Kinase-substrate prediction using phosphoproteomics datasets

Problem Statement: Humphrey et al. (2013) conducted time-course phosphoproteome profiling of insulin stimulated fat cells (3T3-L1). The resulting data can be used to uncover unknown aspects of insulin pathways which is important for treatment of type II diabetes. Akt and mTOR are believed to be central kinases involved in the insulin signalling fat cells. Kinases regulate their substrates by recognising substrate peptide sequence motif. Substrates of the same kinase often have similar response profiles. This similarity in response profiles can be leveraged to predict novel substrates of Akt and mTOR by using learning features from temporal phosphoproteomics data.

Aim: Prediction is a central application in many real-world data analyses. In this project, we will aim to apply classification techniques for predicting novel kinase-substrates.

Objective: Design a predictive model for identifying novel Akt and mTOR substrates using the InsulinPhospho.txt, AUC_Ins.txt, Akt_substrates.txt and mTOR_substrates.txt datasets.

1. Import all required libraries for analysis

library(ggplot2)
library(mlbench)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(caret)
## Loading required package: lattice
library(scatterplot3d)
library(MASS)
library(gbm)
## Loaded gbm 2.1.4
library(class)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:randomForest':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(Hmisc)
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(xgboost)
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
## 
##     slice
library(MLmetrics)
## 
## Attaching package: 'MLmetrics'
## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE
## The following object is masked from 'package:base':
## 
##     Recall
library(mice)
## 
## Attaching package: 'mice'
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
library(tidyverse)
## -- Attaching packages ------------------------------------------------------- tidyverse 1.2.1 --
## v tibble  1.4.2     v purrr   0.2.5
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0
## -- Conflicts ---------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::combine()       masks randomForest::combine()
## x tidyr::complete()      masks mice::complete()
## x dplyr::filter()        masks stats::filter()
## x dplyr::lag()           masks stats::lag()
## x purrr::lift()          masks caret::lift()
## x randomForest::margin() masks ggplot2::margin()
## x dplyr::select()        masks MASS::select()
## x xgboost::slice()       masks dplyr::slice()
## x Hmisc::src()           masks dplyr::src()
## x Hmisc::summarize()     masks dplyr::summarize()
library(corrplot)
## corrplot 0.84 loaded
library(e1071)
## 
## Attaching package: 'e1071'
## The following object is masked from 'package:Hmisc':
## 
##     impute
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
library(stringr)
library(reshape)
## 
## Attaching package: 'reshape'
## The following objects are masked from 'package:tidyr':
## 
##     expand, smiths
## The following object is masked from 'package:dplyr':
## 
##     rename
## The following object is masked from 'package:class':
## 
##     condense
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:purrr':
## 
##     some
## The following object is masked from 'package:dplyr':
## 
##     recode
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:reshape':
## 
##     rename
## The following object is masked from 'package:xgboost':
## 
##     slice
## The following object is masked from 'package:Hmisc':
## 
##     subplot
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(caret)
source("3_RebuiltAdasamplingFunctions.R")
source("functions.R")
source("3_RebuiltAdasamplingFunctions_SVM.R")

2. Import required datasets for analysis.

InsulinPhospho <- read.csv("./Data2/Datasets/InsulinPhospho.txt", sep="\t") # Main temporal phosphoproteomics data.
AUC_Ins <- read.csv("./Data2/Datasets/AUC_Ins.txt", sep="\t") # Secondary temporal phosphoproteomics data.

Akt_Substrates <- read.csv("./Data2/Datasets/Akt_substrates.txt", header = F) # Identifiers of known Akt substrates.
mTOR_Substrates <- read.csv("./Data2/Datasets/mTOR_substrates.txt", header = F) # Identifiers of known mTOR substrates.

# Name the column of Akt and mTOR dataframes.
names(Akt_Substrates) <- "Identifier"
names(mTOR_Substrates) <- "Identifier"

3. Data Exploration, Visualisation and Preparation

3.1 Summary, describe and dim functions.

# Summary of InsulinPhospho
summary(InsulinPhospho)
##                Identifier            Seq.Window       Avg.Fold       
##  100043914;88;      :    1   EEQSSASVKKDET:    3   Min.   :-3.57566  
##  100043914;89;      :    1   EMKNEKSEEEQSS:    3   1st Qu.:-0.10775  
##  1110018J18RIK;18;  :    1   GPEDPKSQVGPED:    3   Median : 0.05211  
##  1110037F02RIK;133; :    1   KSEEEQSSASVKK:    3   Mean   : 0.18380  
##  1110037F02RIK;138; :    1   _____MSETAPAA:    2   3rd Qu.: 0.31962  
##  1110037F02RIK;1628;:    1   AAAAGDSDSWDAD:    2   Max.   : 4.66251  
##  (Other)            :12056   (Other)      :12046                     
##       AUC              X15s               X30s          
##  Min.   :0.1267   Min.   :-6.43115   Min.   :-6.048357  
##  1st Qu.:0.4210   1st Qu.:-0.16701   1st Qu.:-0.162779  
##  Median :0.5118   Median : 0.01071   Median : 0.005617  
##  Mean   :0.5080   Mean   : 0.02670   Mean   : 0.030709  
##  3rd Qu.:0.5936   3rd Qu.: 0.20577   3rd Qu.: 0.190593  
##  Max.   :0.8994   Max.   : 8.09584   Max.   : 5.675916  
##                                                         
##       X1m                X2m                X5m          
##  Min.   :-4.23629   Min.   :-6.50937   Min.   :-5.35020  
##  1st Qu.:-0.16813   1st Qu.:-0.15529   1st Qu.:-0.14475  
##  Median : 0.01723   Median : 0.03757   Median : 0.04856  
##  Mean   : 0.05697   Mean   : 0.10266   Mean   : 0.23983  
##  3rd Qu.: 0.20631   3rd Qu.: 0.26007   3rd Qu.: 0.38010  
##  Max.   : 6.52259   Max.   : 5.92652   Max.   : 6.29936  
##                                                          
##       X10m               X20m               X60m         
##  Min.   :-4.48161   Min.   :-3.80870   Min.   :-4.99095  
##  1st Qu.:-0.16683   1st Qu.:-0.18933   1st Qu.:-0.20728  
##  Median : 0.06984   Median : 0.06777   Median : 0.08565  
##  Mean   : 0.32615   Mean   : 0.34238   Mean   : 0.34501  
##  3rd Qu.: 0.51104   3rd Qu.: 0.55507   3rd Qu.: 0.64820  
##  Max.   : 9.68638   Max.   : 7.82420   Max.   : 9.17293  
##                                                          
##      Ins.1               LY                Ins.2         
##  Min.   :-9.2126   Min.   :-5.931833   Min.   :-9.48170  
##  1st Qu.:-0.1994   1st Qu.:-0.295927   1st Qu.:-0.20931  
##  Median : 0.1000   Median :-0.004296   Median : 0.08076  
##  Mean   : 0.2108   Mean   : 0.039761   Mean   : 0.23615  
##  3rd Qu.: 0.4654   3rd Qu.: 0.294982   3rd Qu.: 0.50019  
##  Max.   : 5.6466   Max.   : 4.961114   Max.   : 7.41844  
##                                                          
##        MK          
##  Min.   :-5.09015  
##  1st Qu.:-0.15985  
##  Median : 0.03766  
##  Mean   : 0.19743  
##  3rd Qu.: 0.34691  
##  Max.   : 7.65471  
## 
describe(InsulinPhospho)
## InsulinPhospho 
## 
##  16  Variables      12062  Observations
## ---------------------------------------------------------------------------
## Identifier 
##        n  missing distinct 
##    12062        0    12062 
## 
## lowest : 100043914;88;      100043914;89;      1110018J18RIK;18;  1110037F02RIK;133; 1110037F02RIK;138;
## highest: ZNRF2;73;          ZNRF2;75;          ZW10;438;          ZYX;272;           ZYX;336;          
## ---------------------------------------------------------------------------
## Seq.Window 
##        n  missing distinct 
##    12062        0    11983 
## 
## lowest : _____MSAGGDFG _____MSAGSSCS _____MSASAGGS _____MSASSLLE _____MSDFDEFE
## highest: YYGRDRSPLRRNA YYLRKDSLSMGVS YYRGSVSLNSSPK YYRRTHSDASDDE YYTSPNSPTSSSR
## ---------------------------------------------------------------------------
## Avg.Fold 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12062        0    11985        1   0.1838    0.573 -0.45187 -0.28681 
##      .25      .50      .75      .90      .95 
## -0.10775  0.05211  0.31962  0.85661  1.36784 
## 
## lowest : -3.575662 -2.982594 -2.875546 -2.719204 -2.564780
## highest:  4.450629  4.505098  4.505757  4.510543  4.662506
## ---------------------------------------------------------------------------
## AUC 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12062        0    11985        1    0.508   0.1422   0.2949   0.3419 
##      .25      .50      .75      .90      .95 
##   0.4210   0.5118   0.5936   0.6712   0.7128 
## 
## lowest : 0.1267419 0.1354366 0.1387156 0.1533232 0.1561299
## highest: 0.8607463 0.8620492 0.8627747 0.8723514 0.8993557
## ---------------------------------------------------------------------------
## X15s 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12062        0    10850        1   0.0267   0.4982 -0.66341 -0.42668 
##      .25      .50      .75      .90      .95 
## -0.16701  0.01071  0.20577  0.50007  0.76590 
## 
## lowest : -6.431152 -5.707707 -4.708113 -4.679189 -4.351233
## highest:  5.861646  5.912577  6.243705  6.505302  8.095836
## ---------------------------------------------------------------------------
## X30s 
##         n   missing  distinct      Info      Mean       Gmd       .05 
##     12062         0     10701         1   0.03071    0.4819 -0.642325 
##       .10       .25       .50       .75       .90       .95 
## -0.398556 -0.162779  0.005617  0.190593  0.464509  0.753963 
## 
## lowest : -6.048357 -4.580469 -4.344057 -4.257939 -4.008610
## highest:  4.820007  4.864614  5.008597  5.385148  5.675916
## ---------------------------------------------------------------------------
## X1m 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12062        0    10821        1  0.05697   0.5354 -0.67915 -0.42421 
##      .25      .50      .75      .90      .95 
## -0.16813  0.01723  0.20631  0.53875  0.86327 
## 
## lowest : -4.236292 -3.789843 -3.746004 -3.735446 -3.584690
## highest:  4.941939  4.967731  5.074889  5.315292  6.522594
## ---------------------------------------------------------------------------
## X2m 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12062        0    10931        1   0.1027   0.5753 -0.62656 -0.41236 
##      .25      .50      .75      .90      .95 
## -0.15529  0.03757  0.26007  0.62645  1.03597 
## 
## lowest : -6.509371 -4.483367 -3.834470 -3.639205 -3.570564
## highest:  5.113640  5.262066  5.840508  5.884139  5.926518
## ---------------------------------------------------------------------------
## X5m 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12062        0    10937        1   0.2398   0.7696 -0.60440 -0.36966 
##      .25      .50      .75      .90      .95 
## -0.14475  0.04856  0.38010  1.17006  1.94407 
## 
## lowest : -5.350203 -4.232269 -4.044319 -3.620915 -3.302839
## highest:  4.917470  5.133200  5.157161  5.238157  6.299364
## ---------------------------------------------------------------------------
## X10m 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12062        0    10941        1   0.3261   0.9349 -0.66801 -0.41619 
##      .25      .50      .75      .90      .95 
## -0.16683  0.06984  0.51104  1.53571  2.46632 
## 
## lowest : -4.481613 -3.594390 -3.531200 -3.505036 -3.184994
## highest:  5.552755  5.618058  6.025959  6.273357  9.686378
## ---------------------------------------------------------------------------
## X20m 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12062        0    10974        1   0.3424    1.012 -0.73443 -0.48258 
##      .25      .50      .75      .90      .95 
## -0.18933  0.06777  0.55507  1.72995  2.62390 
## 
## lowest : -3.808696 -3.764977 -3.680030 -3.664272 -3.410112
## highest:  5.881314  6.059930  6.373999  6.867712  7.824198
## ---------------------------------------------------------------------------
## X60m 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12062        0    11095        1    0.345    1.084 -0.93517 -0.57109 
##      .25      .50      .75      .90      .95 
## -0.20728  0.08565  0.64820  1.82106  2.59184 
## 
## lowest : -4.990950 -4.282663 -4.127868 -3.973265 -3.874971
## highest:  6.081722  6.281500  7.917290  9.063885  9.172934
## ---------------------------------------------------------------------------
## Ins.1 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12062        0    10715        1   0.2108   0.9678  -1.1473  -0.6703 
##      .25      .50      .75      .90      .95 
##  -0.1994   0.1000   0.4654   1.4370   2.0989 
## 
## lowest : -9.212608 -9.069396 -8.781314 -8.731222 -7.243537
## highest:  4.602209  4.663631  4.703433  5.477451  5.646595
## ---------------------------------------------------------------------------
## LY 
##         n   missing  distinct      Info      Mean       Gmd       .05 
##     12062         0     10745         1   0.03976    0.7696 -1.076422 
##       .10       .25       .50       .75       .90       .95 
## -0.720520 -0.295927 -0.004296  0.294982  0.897364  1.428507 
## 
## lowest : -5.931833 -5.775406 -4.270598 -4.156807 -4.122705
## highest:  4.095418  4.101230  4.183724  4.305606  4.961114
## ---------------------------------------------------------------------------
## Ins.2 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12062        0    10857        1   0.2362    1.016 -1.15844 -0.66518 
##      .25      .50      .75      .90      .95 
## -0.20931  0.08076  0.50019  1.60192  2.27527 
## 
## lowest : -9.481698 -8.560064 -5.663465 -5.433093 -5.309860
## highest:  4.767779  4.818722  4.848097  5.133769  7.418443
## ---------------------------------------------------------------------------
## MK 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12062        0    10706        1   0.1974   0.7927 -0.75905 -0.47456 
##      .25      .50      .75      .90      .95 
## -0.15985  0.03766  0.34691  1.19754  1.97925 
## 
## lowest : -5.090152 -4.707793 -4.466225 -4.301273 -4.295086
## highest:  4.421762  4.516157  4.576791  4.845139  7.654708
## ---------------------------------------------------------------------------
dim(InsulinPhospho)
## [1] 12062    16
# Summary of AUC_Ins
summary(AUC_Ins)
##                Identifier            Seq.Window         AUC        
##  100043914;88;      :    1   EEQSSASVKKDET:    3   Min.   :0.1267  
##  100043914;89;      :    1   EMKNEKSEEEQSS:    3   1st Qu.:0.4210  
##  1110018J18RIK;18;  :    1   GPEDPKSQVGPED:    3   Median :0.5118  
##  1110037F02RIK;133; :    1   KSEEEQSSASVKK:    3   Mean   :0.5080  
##  1110037F02RIK;138; :    1   _____MSETAPAA:    2   3rd Qu.:0.5936  
##  1110037F02RIK;1628;:    1   AAAAGDSDSWDAD:    2   Max.   :0.8994  
##  (Other)            :12056   (Other)      :12046
describe(AUC_Ins)
## AUC_Ins 
## 
##  3  Variables      12062  Observations
## ---------------------------------------------------------------------------
## Identifier 
##        n  missing distinct 
##    12062        0    12062 
## 
## lowest : 100043914;88;      100043914;89;      1110018J18RIK;18;  1110037F02RIK;133; 1110037F02RIK;138;
## highest: ZNRF2;73;          ZNRF2;75;          ZW10;438;          ZYX;272;           ZYX;336;          
## ---------------------------------------------------------------------------
## Seq.Window 
##        n  missing distinct 
##    12062        0    11983 
## 
## lowest : _____MSAGGDFG _____MSAGSSCS _____MSASAGGS _____MSASSLLE _____MSDFDEFE
## highest: YYGRDRSPLRRNA YYLRKDSLSMGVS YYRGSVSLNSSPK YYRRTHSDASDDE YYTSPNSPTSSSR
## ---------------------------------------------------------------------------
## AUC 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12062        0    11985        1    0.508   0.1422   0.2949   0.3419 
##      .25      .50      .75      .90      .95 
##   0.4210   0.5118   0.5936   0.6712   0.7128 
## 
## lowest : 0.1267419 0.1354366 0.1387156 0.1533232 0.1561299
## highest: 0.8607463 0.8620492 0.8627747 0.8723514 0.8993557
## ---------------------------------------------------------------------------
dim(AUC_Ins)
## [1] 12062     3
# Summary of Akt_Substrates
summary(Akt_Substrates)
##        Identifier
##  AKT1S1;318;: 1  
##  AS160;324; : 1  
##  AS160;595; : 1  
##  AS160;649; : 1  
##  BAD;136;   : 1  
##  BRF1;92;   : 1  
##  (Other)    :16
describe(Akt_Substrates)
## Akt_Substrates 
## 
##  1  Variables      22  Observations
## ---------------------------------------------------------------------------
## Identifier 
##        n  missing distinct 
##       22        0       22 
## 
## lowest : AKT1S1;318; AS160;324;  AS160;595;  AS160;649;  BAD;136;   
## highest: PFKFB2;486; RANBP3;58;  TSC2;1466;  TSC2;939;   TSC2;981;  
## ---------------------------------------------------------------------------
dim(Akt_Substrates)
## [1] 22  1
# Summary of mTOR_Substrates
summary(mTOR_Substrates)
##          Identifier
##  AHNAK;5536;  : 1  
##  AKT1S1;255;  : 1  
##  ANKRD17;2041;: 1  
##  APRF;727;    : 1  
##  DEPDC6;265;  : 1  
##  DEPDC6;293;  : 1  
##  (Other)      :20
describe(mTOR_Substrates)
## mTOR_Substrates 
## 
##  1  Variables      26  Observations
## ---------------------------------------------------------------------------
## Identifier 
##        n  missing distinct 
##       26        0       26 
## 
## lowest : AHNAK;5536;   AKT1S1;255;   ANKRD17;2041; APRF;727;     DEPDC6;265;  
## highest: RPS6KB1;427;  RPS6KB1;444;  RPS6KB1;452;  TBC1D10B;693; ULK1;763;    
## ---------------------------------------------------------------------------
dim(mTOR_Substrates)
## [1] 26  1

3.2 Examination of AUC_Ins

Both temporal phosphoproteomics dataframe contain the same number of rows (samples), with different numbers of columns (features). Check to see that the secondary temporal phosphoproteomics dataframe (AUC_Ins) contains different data to InsulinPhospho.

test <- AUC_Ins %>% merge(InsulinPhospho, by="Identifier") # Merge the dataframes using their unique identifier to join on.

# Create a dataframe where AUC_Ins features = InsulinPhospho features.
test.result <- test %>% dplyr::select(Identifier, Seq.Window.x, Seq.Window.y, AUC.x, AUC.y) %>% filter(Seq.Window.x == Seq.Window.y | AUC.x == AUC.y) 

head(test.result)
##            Identifier  Seq.Window.x  Seq.Window.y     AUC.x     AUC.y
## 1       100043914;88; GRHERRSSAEQHS GRHERRSSAEQHS 0.2967288 0.2967288
## 2       100043914;89; RHERRSSAEQHSE RHERRSSAEQHSE 0.5279128 0.5279128
## 3   1110018J18RIK;18; CAGRVPSPAGSVT CAGRVPSPAGSVT 0.5240430 0.5240430
## 4  1110037F02RIK;133; SVDRVISHDRDSP SVDRVISHDRDSP 0.6480379 0.6480379
## 5  1110037F02RIK;138; ISHDRDSPPPPPP ISHDRDSPPPPPP 0.5002719 0.5002719
## 6 1110037F02RIK;1628; FLSEPSSPGRSKT FLSEPSSPGRSKT 0.4370191 0.4370191
write.csv(test.result, file = "./Output/AUC_Ins_test_result.csv")

dim(test.result) # test has 12,062 rows (the same number of rows as InsulinPhospho). This shows that AUC_Ins contains no new data.
## [1] 12062     5
rm(test.result) #drop the test data.frame since we don't actually need it.

# There is no need to consider data from AUC_Ins because it is contained within InsulinPhospho.

3.3 Create a combined master data set

Identifiers for known Akt_substrates and mTOR_substrates are found in Akt_substrates and mTOR_substrates data sets. Append known substrates fields to InsulinPhospho columns to create a master data set.

dat <- InsulinPhospho #Name master data set.

# Add Akt Substrates to dat
dat$aktSubstrate <- 0 # Add new column of zeros to dat for aktSubstrate
dat$aktSubstrate[dat$Identifier %in% Akt_Substrates$Identifier] <- 1 # Label rows with known AktSubstrates with a 1.
dat$aktSubstrate <- as.factor(dat$aktSubstrate) # Change labels to factors.
dat %>% filter(aktSubstrate == 1) # There should be 22 Akt Substrates. Check to make sure this is the case.
##       Identifier    Seq.Window  Avg.Fold       AUC        X15s      X30s
## 1    AKT1S1;318; PRPRLNTSDFQKL 3.5063686 0.2030335  1.79324937 2.9496267
## 2     AS160;324; FRSRCSSVTGVMQ 2.6444937 0.2217749  1.09004613 2.4160883
## 3     AS160;595; MRGRLGSMDSFER 3.5000308 0.2449809  1.14603323 2.7169587
## 4     AS160;649; FRRRAHTFSHPPS 2.7125564 0.3436261  1.86233676 2.5158617
## 5       BAD;136; FRGRSRSAPPNLW 1.1134673 0.3063216  0.26743636 0.6608665
## 6       BRF1;92; FRDRSFSEGGERL 2.0512883 0.3971139  0.31345060 1.6545300
## 7    CARHSP1;53; RRTRTFSATVRAS 0.6102895 0.3393443  0.10286623 0.4214935
## 8    ECNOS;1176; SRIRTQSFSLQER 3.3036490 0.3120402  1.95858851 2.9400282
## 9      EDC3;161; FRRRHNSWSSSSR 0.7890872 0.5122196  0.17997063 0.3818158
## 10    FKHR2;252; PRRRAVSMDNSNK 3.3713432 0.2581184  1.79702760 2.7233798
## 11    FOXO1;316; FRPRTSSNASTIS 2.0303375 0.3183012  1.36537800 2.1674276
## 12     GSK3A;21; GRARTSSFAEPGG 1.8109899 0.2360954  1.05994278 1.7976242
## 13      GSK3B;9; GRPRTTSFAESCK 2.2435031 0.2317606  2.07312529 2.0480453
## 14     IRS1;265; FRPRSKSQSSSSC 2.4203834 0.1874690  1.80474628 2.2272248
## 15     IRS1;522; FRKRTHSAGTSPT 3.5017097 0.2202295  3.18479801 3.7653269
## 16 KIAA1272;715; MRFRSATTSGAPG 3.6315602 0.2045568  2.90029590 3.9743589
## 17   PFKFB2;469; VRMRRNSFTPLSS 2.1497783 0.2405669 -0.61652987 2.3467934
## 18   PFKFB2;486; RRPRNYSVGSRPL 2.4928085 0.3131412  1.04467800 1.9926449
## 19    RANBP3;58; KRERTSSLTHSEE 1.4981902 0.4456912  0.02024329 0.6831158
## 20    TSC2;1466; LRPRGYTISDSAP 2.2595063 0.3103911  1.09374609 2.2927237
## 21     TSC2;939; FRARSTSLNERPK 1.7348112 0.2579600  0.85456623 1.6825158
## 22     TSC2;981; FRCRSISVSEHVV 2.0218485 0.2256770  0.64320312 1.7322447
##          X1m       X2m       X5m      X10m      X20m      X60m     Ins.1
## 1  3.6929821 3.6410960 4.0850147 3.8988759 3.9781234 4.0119808 2.4435006
## 2  2.7808397 2.8150416 3.1520775 2.9641961 2.8741745 3.0634861 2.5618882
## 3  3.8662323 3.8460953 4.1608367 4.3124341 4.0466843 3.9049719 3.0435020
## 4  3.8968884 3.3103308 2.7608542 3.2505217 1.6278092 2.4758485 1.9556848
## 5  1.0895884 1.3932881 1.1977503 1.4687109 1.4838691 1.3462286 0.8771160
## 6  1.8278517 2.3240815 2.0455537 2.1632451 2.9987398 3.0828541 1.5203971
## 7  0.6383167 0.6397195 0.7636132 0.8609022 0.7909329 0.6644715 0.5115692
## 8  3.5337725 4.3957991 4.5197697 2.9879681 2.9856010 3.1076649 2.7818330
## 9  0.6641306 0.7669697 0.8166649 1.1021829 1.5026015 0.8983615 0.8567865
## 10 3.7485577 4.2097757 3.6186715 3.4755719 3.4267511 3.9710105 2.8694427
## 11 2.4548910 2.8385740 2.4755004 1.6673830 1.7489870 1.5245591 1.5571069
## 12 2.0253917 2.2009603 1.9187483 1.8104378 1.6001571 2.0746572 1.8268434
## 13 2.5860696 2.7226190 2.2595069 2.1286849 1.6998950 2.4300785 1.6249572
## 14 2.2860361 2.3195207 2.5772457 2.7773139 2.7513025 2.6196770 2.0322772
## 15 4.2802085 3.8039699 3.2472222 3.6922333 3.4138487 2.6260700 2.6677332
## 16 3.7084127 4.2649214 3.5941510 2.9888252 3.7965945 3.8249220 3.0211066
## 17 2.0548841 2.6590394 2.7341940 2.6148573 2.8127280 2.5922604 2.1981597
## 18 2.5012813 2.5750736 2.7087517 2.7991921 3.3598561 2.9609903 2.1681606
## 19 1.2336567 1.8135842 2.0102513 2.4843846 1.8031033 1.9371825 1.3780121
## 20 1.8103598 2.4672887 2.3265719 3.0254502 2.2898481 2.7700618 2.2456478
## 21 1.9670307 1.7871966 1.6364653 1.9876280 1.8068095 2.1562772 1.7609448
## 22 2.2642534 2.4280360 2.3883727 2.3669369 2.0835106 2.2682308 1.7042523
##             LY     Ins.2          MK aktSubstrate
## 1   0.90188056 2.6152986 -0.84959630            1
## 2   0.35895883 2.6616383 -0.60092737            1
## 3   2.08987098 4.2587563  1.99019248            1
## 4   1.35371850 3.2326608 -0.77850472            1
## 5   0.41565051 1.0552651 -0.27522751            1
## 6   0.64247054 2.7904801  1.29519404            1
## 7  -0.09615960 0.5439719  0.35782560            1
## 8  -0.02538751 2.5012608  0.63932479            1
## 9   0.03040671 0.6190486  0.33319472            1
## 10  1.27535861 3.2895524  0.06694804            1
## 11  0.60681582 1.9505807  0.69546036            1
## 12  0.76553475 1.9555741  0.74038355            1
## 13  0.72180764 1.8650874  1.20263838            1
## 14  0.81532949 2.1558464  1.47421174            1
## 15  2.06767313 3.5093158  2.23333590            1
## 16  1.05580714 1.5761317 -1.17459827            1
## 17  1.27485624 0.7385128  1.58090204            1
## 18  1.22817222 2.1286233  1.56554502            1
## 19  0.70531444 1.5717253  1.21182212            1
## 20  1.03393391 2.5180929 -0.49254328            1
## 21  0.75034924 1.4731697 -0.80007329            1
## 22  0.60131618 1.9982121  0.19181529            1
# Repeating the same process as above for mTOR substrates:
dat$mTORSubstrate <- 0
dat$mTORSubstrate[dat$Identifier %in% mTOR_Substrates$Identifier] <- 1
dat %>% filter(mTORSubstrate == 1) # There should be 26 Akt Substrates. Check to make sure this is the case.
##       Identifier    Seq.Window    Avg.Fold       AUC         X15s
## 1    AHNAK;5536; VEAEASSPKGKFS  0.42633446 0.6567462 -0.176054689
## 2    AKT1S1;255; TQQYAKSLPVSVP  0.98999183 0.4360069 -0.007265314
## 3  ANKRD17;2041; PVSSPSSPSPPAQ  0.64920432 0.5362401  0.009309093
## 4      APRF;727; TIDLPMSPRTLDS  0.55215123 0.5233701 -0.426227432
## 5    DEPDC6;265; TSFMSVSPSKEIK  0.51836151 0.6913704  0.019358169
## 6    DEPDC6;293; SGYFSSSPTLSSS  0.69685396 0.6831728  0.541223454
## 7    DEPDC6;299; SPTLSSSPPVLCN  0.40443692 0.4907117  0.432595528
## 8   EIF4EBP1;36; PGDYSTTPGGTLF  0.13496319 0.4781614  0.053892242
## 9   EIF4EBP1;64; LMECRNSPVAKTP  0.71742741 0.5722157 -0.065138500
## 10  EIF4EBP2;37; PQDYCTTPGGTLF  0.43604513 0.5058809 -0.182937288
## 11  EIF4EBP2;46; GTLFSTTPGGTRI  0.17340754 0.3716139  0.232247973
## 12    FRAP;2481; VPESIHSFIGDGL  0.56419310 0.4557192  0.157769811
## 13    GRB10;503; NILSSQSPLHPST  0.81454942 0.5174816 -0.751905100
## 14     IRS1;632; GDYMPMSPKSVSA  1.49906522 0.5938386 -0.030464601
## 15      MAF1;60; HVLEALSPPQTSG  1.19491669 0.5144249 -0.670922080
## 16      MAF1;68; PQTSGLSPSRLSK  1.00365258 0.5403878 -0.555997216
## 17   NUMA1;1982; THQGPGTPESKKA  0.83731208 0.6926504 -0.587028000
## 18    PATL1;184; TSPIIGSPPVRAV  1.03362489 0.5261557  0.128700234
## 19    PATL1;194; RAVPIGTPPKQMA  1.00472797 0.5992069  0.436770044
## 20   RAPTOR;863; TQSAPASPTNKGM  0.84801251 0.5054830  0.120069906
## 21   RAPTOR;877; MHQVGGSPPASST -0.00532214 0.6726739 -0.191369510
## 22  RPS6KB1;427; SVKEKFSFEPKIR  2.23146956 0.5458955  0.084781349
## 23  RPS6KB1;444; FIGSPRTPVSPVK  1.82587943 0.5360684  0.590065853
## 24  RPS6KB1;452; VSPVKFSPGDFWG  2.23331222 0.5497894  2.291144987
## 25 TBC1D10B;693; LHPSLPSPTGNST  0.80164891 0.5476402  0.144077311
## 26     ULK1;763; VVFTVGSPPSGAT  0.72853856 0.7018710  0.533573344
##            X30s          X1m         X2m         X5m      X10m       X20m
## 1  -0.217390655 -0.334683918 -0.44519263  0.25533667 1.0464213  1.5109363
## 2   0.160027912  0.547349427  1.07587947  1.50442263 1.5013697  1.5893041
## 3  -0.015606857  0.207165706  0.34081470  1.00575082 1.2573407  1.1973539
## 4  -0.964410775 -0.194466781 -0.24003503  1.39900189 1.4489483  2.0398072
## 5  -0.156930912 -0.134111736  0.33140181  0.57627243 0.7510608  1.0707849
## 6   0.067775694  0.409180800  0.53987771  1.02175763 2.0597414 -0.0228508
## 7  -0.070780177  0.186650146  0.23696274  1.20723293 0.3657260  1.8393274
## 8  -0.090965206 -0.008004236  0.16407134  0.17895159 0.2498839  0.2264721
## 9  -0.044541666  0.111832061  0.30066749  1.08497803 1.5636114  1.4150152
## 10  0.191076439  1.019805586  0.50131281  0.46518926 0.5337301  0.5652371
## 11  0.082650222 -0.208825550  0.47783821 -1.17673824 0.5174982  0.5786523
## 12  0.061129258 -0.142141769  0.41436027  1.01980520 1.0373473  0.9354886
## 13 -0.118098662 -0.001528758  0.06968892  1.36414164 1.6279706  2.2219585
## 14 -0.053877677  0.009467811  0.72018799  2.30840705 2.8996614  3.3386457
## 15 -0.564868134  0.178840508  0.62393093  2.20634190 2.6196297  2.8769290
## 16 -0.200317130  0.256986250  1.17604053  1.22469445 1.9409463  1.6891156
## 17 -0.643446989 -0.491875735 -0.76117779 -0.08562473 1.9657848  3.7091163
## 18  0.010209789 -0.182544297  0.50174870  1.93246904 1.8129093  2.1285754
## 19  0.016577667  0.088870846  0.37768135  0.89375770 2.1781633  1.9382400
## 20  0.075364731  0.323780019  0.55571219  1.40396490 1.5468006  1.4289157
## 21 -0.202757468 -0.099480367 -0.05867325 -0.12431169 0.1100442  0.1877505
## 22  0.512462410  0.177735972  0.68961107  3.59354117 4.2936110  4.1805202
## 23  0.566495011  0.411982801  0.87967942  3.40790889 2.4611260  3.5691473
## 24  0.397956875  0.243112184  0.25134133  2.25508225 3.8001843  4.2717838
## 25 -0.684524590 -0.218289064 -0.03673607  1.26383575 1.8261929  1.8335039
## 26 -0.008488011 -0.263439870 -0.43188586  0.38466278 0.6540405  2.0991027
##          X60m      Ins.1          LY      Ins.2          MK aktSubstrate
## 1   1.7713032  1.7586017  1.22391675  2.1768015  1.72792046            0
## 2   1.5488467  1.2284801 -1.48212971  1.3697177 -0.34351278            0
## 3   1.1915065  1.0306892  0.19244669  0.3847130  0.33529916            0
## 4   1.3545924  1.8353582  1.11962178  1.9947969  2.10989541            0
## 5   1.6890566  1.1304697 -0.32769234  0.9984844 -0.06444973            0
## 6   0.9581258  2.6999513 -2.04114990  1.0310333  0.94417969            0
## 7  -0.9622192 -0.1700383  0.17638227 -0.9213689 -0.77147990            0
## 8   0.3054039  0.2788174 -2.44748521  0.4462754 -0.21536788            0
## 9   1.3729952  0.8096614 -1.21943792  1.0433781  0.48140228            0
## 10  0.3949470  0.8233308 -3.25529820  0.6229304  0.28226187            0
## 11  0.8839371  0.8063951 -2.80701057 -0.2063343 -0.40697727            0
## 12  1.0297861  0.7931051 -1.64855258  2.1314578 -0.28510830            0
## 13  2.1041683  1.9955565 -1.98625969  1.8308322  0.57329966            0
## 14  2.8004941  2.6139795  1.59579007  2.9494317  2.68473505            0
## 15  2.2894517  2.6095269 -0.84347478  2.3496141 -0.63037407            0
## 16  2.4977519  2.5119487 -1.26711434  2.2076210  0.49737663            0
## 17  3.5927488  0.1467855  0.03266511  3.7977174  2.91975080            0
## 18  1.9369309  1.8985180 -2.59797642  1.7865717 -0.43711530            0
## 19  2.1077629  1.7193617  0.59199044  2.2427720  0.13333731            0
## 20  1.3294921  1.2082051  0.49251979  1.1405824  1.04180369            0
## 21  0.3362205  0.7923556 -0.36974365  0.2824731  0.05479087            0
## 22  4.3194933  3.0319245 -5.77540647  3.7017708  0.92052198            0
## 23  2.7206302  1.1003722  3.11909052  2.3925158  4.26581642            0
## 24  4.3558920  3.7749452 -2.88683294  3.1002713  0.15317365            0
## 25  2.2851311  0.9178701  1.34152331  1.6354072  1.40732826            0
## 26  2.8607429  1.9964985 -1.27744990  1.7056683 -0.44742619            0
##    mTORSubstrate
## 1              1
## 2              1
## 3              1
## 4              1
## 5              1
## 6              1
## 7              1
## 8              1
## 9              1
## 10             1
## 11             1
## 12             1
## 13             1
## 14             1
## 15             1
## 16             1
## 17             1
## 18             1
## 19             1
## 20             1
## 21             1
## 22             1
## 23             1
## 24             1
## 25             1
## 26             1
dat$mTORSubstrate <- as.factor(dat$mTORSubstrate)

# Check for NAs in the master dataset, to identify whether imputation or field/record removal is required
sum(is.na(dat)) # No NA's in dataset. No imputation or data removal required.
## [1] 0

3.4 Visualisation

A series of visualisations were created to better understand the data:
a) Visualise the data to understand patterns between the features, and identify potential new features.
b) Visualise the log2 time-series insulin stimulated phosphorylation profiles fold change.
c) Scatter plots
d) Violin plots
e) kernel density plots

# Create a dataframe of the time series data, renaming the time columns to allow for ordering.
dat.timeseries <- dat
dat.timeseries <- dat.timeseries %>% dplyr::rename("1" = "X15s")
dat.timeseries <- dat.timeseries %>% dplyr::rename("2" = "X30s")
dat.timeseries <- dat.timeseries %>% dplyr::rename("3" = "X1m")
dat.timeseries <- dat.timeseries %>% dplyr::rename("4" = "X2m")
dat.timeseries <- dat.timeseries %>% dplyr::rename("5" = "X5m")
dat.timeseries <- dat.timeseries %>% dplyr::rename("6" = "X10m")
dat.timeseries <- dat.timeseries %>% dplyr::rename("7" = "X20m")
dat.timeseries <- dat.timeseries %>% dplyr::rename("8" = "X60m")

# Use melt to stack all the time-series data into a single column for plotting. Order dataframe by Identifier.
my.result <- melt(dat.timeseries, id = c("Identifier","Seq.Window","Avg.Fold", "AUC","Ins.1","LY","Ins.2","MK","aktSubstrate","mTORSubstrate")) 
my.result <- my.result[order(my.result$Identifier),]

# The time-series profiles start at different y values. Adjust all y values to start at zero. This allows for an "apples to apples comparison" of the time-series.
my.result.var1 <- my.result %>% filter(variable == 1) %>% select(Identifier, value)
my.result2 <- my.result %>% merge(my.result.var1, by="Identifier", all.x=TRUE) %>% mutate(value.offset = value.x - value.y)

3.4.1 Time-series response curves for Akt and mTOR

Plot the time-series response curves for Akt and mTOR to examine differences in their profiles.

ggplot(my.result2 %>% filter(aktSubstrate==1 | mTORSubstrate == 1))+
  geom_line(aes(x=variable, y=value.offset, colour = aktSubstrate, group=Identifier))

A notable difference can be seen between the Akt (1) and mTOR (0) response curves between 1 (time = 15s) and 6 (time=10mins). Beyond 6 (time=10mins) both response curse appear to have a similar flat/negative response.

3.4.2 Replot Time-series response curves using mean of Akt and mTOR

Re-plot the above graph taking the mean of Akt and mTOR. Include error bars of 1 standard deviation.

# Use group-by to combine Akt and mTOR samples at each time interval and calculate the mean and standard deviation.
my.result2.grouped <- my.result2 %>% group_by(aktSubstrate, mTORSubstrate, variable) %>% dplyr::summarize(value.offset.mn = mean(value.offset), value.offset.var = var(value.offset)) %>% data.frame %>% mutate(value.offset.sd = sqrt(value.offset.var))

# Filter for rows containing know Akt and mTOR substrates (i.e. remove all unknown rows)
my.result2.grouped
##    aktSubstrate mTORSubstrate variable value.offset.mn value.offset.var
## 1             0             0        1     0.000000000        0.0000000
## 2             0             0        2     0.002683066        0.2315073
## 3             0             0        3     0.027998540        0.3605401
## 4             0             0        4     0.072980073        0.4372112
## 5             0             0        5     0.209028927        0.7272481
## 6             0             0        6     0.294750525        0.9976316
## 7             0             0        7     0.310681932        1.0937986
## 8             0             0        8     0.313309799        1.1477497
## 9             0             1        1     0.000000000        0.0000000
## 10            0             1        2    -0.154828787        0.2352914
## 11            0             1        3    -0.009111597        0.3806842
## 12            0             1        4     0.235570723        0.5077636
## 13            0             1        5     1.093867783        1.1842623
## 14            0             1        6     1.536133636        1.0623229
## 15            0             1        7     1.780329361        1.4661913
## 16            0             1        8     1.728650862        1.5166222
## 17            1             0        1     0.000000000        0.0000000
## 18            1             0        2     0.915977114        0.3683177
## 19            1             0        3     1.316928956        0.4384971
## 20            1             0        4     1.512899211        0.5600328
## 21            1             0        5     1.411754051        0.7790154
## 22            1             0        6     1.404033533        0.7688073
## 23            1             0        7     1.315578573        0.9530284
## 24            1             0        8     1.380574848        0.8755800
##    value.offset.sd
## 1        0.0000000
## 2        0.4811520
## 3        0.6004499
## 4        0.6612195
## 5        0.8527884
## 6        0.9988151
## 7        1.0458483
## 8        1.0713308
## 9        0.0000000
## 10       0.4850684
## 11       0.6169961
## 12       0.7125753
## 13       1.0882382
## 14       1.0306905
## 15       1.2108639
## 16       1.2315122
## 17       0.0000000
## 18       0.6068918
## 19       0.6621911
## 20       0.7483534
## 21       0.8826185
## 22       0.8768166
## 23       0.9762317
## 24       0.9357243
my.result2.grouped %>% filter(aktSubstrate == 1 | mTORSubstrate == 1)
##    aktSubstrate mTORSubstrate variable value.offset.mn value.offset.var
## 1             0             1        1     0.000000000        0.0000000
## 2             0             1        2    -0.154828787        0.2352914
## 3             0             1        3    -0.009111597        0.3806842
## 4             0             1        4     0.235570723        0.5077636
## 5             0             1        5     1.093867783        1.1842623
## 6             0             1        6     1.536133636        1.0623229
## 7             0             1        7     1.780329361        1.4661913
## 8             0             1        8     1.728650862        1.5166222
## 9             1             0        1     0.000000000        0.0000000
## 10            1             0        2     0.915977114        0.3683177
## 11            1             0        3     1.316928956        0.4384971
## 12            1             0        4     1.512899211        0.5600328
## 13            1             0        5     1.411754051        0.7790154
## 14            1             0        6     1.404033533        0.7688073
## 15            1             0        7     1.315578573        0.9530284
## 16            1             0        8     1.380574848        0.8755800
##    value.offset.sd
## 1        0.0000000
## 2        0.4850684
## 3        0.6169961
## 4        0.7125753
## 5        1.0882382
## 6        1.0306905
## 7        1.2108639
## 8        1.2315122
## 9        0.0000000
## 10       0.6068918
## 11       0.6621911
## 12       0.7483534
## 13       0.8826185
## 14       0.8768166
## 15       0.9762317
## 16       0.9357243
# Plot mean response including 1 standard deviation bar.
ggplot(my.result2.grouped %>% filter(aktSubstrate==1 | mTORSubstrate == 1))+
  geom_point(aes(x=variable, y=value.offset.mn, colour = aktSubstrate, group=aktSubstrate)) +
  geom_line(aes(x=variable, y=value.offset.mn, colour = aktSubstrate, group=aktSubstrate)) +
  geom_errorbar(width=.4, aes(x=variable,ymin=value.offset.mn-value.offset.sd, ymax=value.offset.mn+value.offset.sd,colour=aktSubstrate))

The first four time segments (1 - 4) show distinctly different responses. Beyond time segment 4 the responses between Akt and mTOR are less distinct.

3.4.3 Scatterplots

Use scatter plots to investigate if any Ins.1, LY, Ins.2, MK responses exhibit any noticeable difference between Akt and mTOR that may help differentiate between them.

# features are not time based. Use master dataframe "dat"

cat.dat <- dat %>% mutate(Type = case_when(aktSubstrate==1~"Akt", mTORSubstrate == 1~"mTOR", TRUE~"Other"))  %>% select(Identifier, Seq.Window, Avg.Fold, AUC, Ins.1, LY, Ins.2, MK, aktSubstrate, mTORSubstrate, Type)

colnames(cat.dat)
##  [1] "Identifier"    "Seq.Window"    "Avg.Fold"      "AUC"          
##  [5] "Ins.1"         "LY"            "Ins.2"         "MK"           
##  [9] "aktSubstrate"  "mTORSubstrate" "Type"
head(cat.dat)
##            Identifier    Seq.Window   Avg.Fold       AUC      Ins.1
## 1       100043914;88; GRHERRSSAEQHS  1.0315122 0.2967288  0.6526217
## 2       100043914;89; RHERRSSAEQHSE  1.2160532 0.5279128  0.6112755
## 3   1110018J18RIK;18; CAGRVPSPAGSVT  0.2860675 0.5240430 -0.1988757
## 4  1110037F02RIK;133; SVDRVISHDRDSP -0.4100882 0.6480379  0.1735112
## 5  1110037F02RIK;138; ISHDRDSPPPPPP -0.3664219 0.5002719  0.2292187
## 6 1110037F02RIK;1628; FLSEPSSPGRSKT -0.2148644 0.4370191  0.2987756
##            LY      Ins.2          MK aktSubstrate mTORSubstrate  Type
## 1  0.34193177  2.1570012  2.13792056            0             0 Other
## 2  0.04620434  3.0963462  3.22683218            0             0 Other
## 3 -0.14777344 -0.2610874 -0.02748394            0             0 Other
## 4 -0.57074291  0.3012657 -0.13305247            0             0 Other
## 5 -0.38138545  0.4130244 -0.13305247            0             0 Other
## 6 -0.45838604  0.1086583  0.02536763            0             0 Other

Use cat.dat to check the distributions of the categorical features.

Scatter plot of Ins.1, LY

Draw scatter plots of Ins.1 and LY to see if the features exhibit any noticeable difference between Akt and mTOR that may help differentiate between them.

sp <- ggplot(cat.dat) +
  geom_point(aes(x=Ins.1, y=LY, color=Type)) +
  xlab("Ins.1") +
  ylab("LY") +
  theme_light()

sp

Scatter plot of Ins.2, MK

Draw scatter plots of Ins.2 and MK to see if the features exhibit any noticeable difference between Akt and mTOR that may help differentiate between them.

sp <- ggplot(cat.dat) +
  geom_point(aes(x=Ins.2, y=MK, color=Type)) +
  xlab("Ins.2") +
  ylab("MK") +
  theme_light()

sp

Scatter plot of Ins.2, Ins.1

sp <- ggplot(cat.dat) +
  geom_point(aes(x=Ins.2, y=Ins.1, color=Type)) +
  xlab("Ins.1") +
  ylab("Ins.2") +
  theme_light()

sp

Scatter plot of LY, MK

sp <- ggplot(cat.dat) +
  geom_point(aes(x=Ins.2, y=MK, color=Type)) +
  xlab("Ins.2") +
  ylab("MK") +
  theme_light()

sp

3.4.4 Boxplots and violin plots

Boxplots and violin plots of Avg.Fold by akt Substrate

require(gridExtra)
## Loading required package: gridExtra
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:randomForest':
## 
##     combine
p1 <- ggplot(cat.dat) +
  geom_boxplot(aes(x=Type, y=Avg.Fold, color=Type)) +
  xlab("Type") +
  ylab("Avg.Fold") +
  ggtitle("Box Plot") + 
  theme_light()

p2 <- ggplot(cat.dat) +
  geom_violin(aes(x=Type, y=Avg.Fold, color=Type), scale = "area") +
  xlab("Type") +
  ylab("AUC") +
  ggtitle("Violin Plot") + 
  theme_light()

grid.arrange(p1, p2, ncol=1)

Boxplot and violin plot of AUC by akt Substrate

p1 <- ggplot(cat.dat) +
  geom_boxplot(aes(x=Type, y=AUC, color=Type)) +
  xlab("Type") +
  ylab("AUC") +
  ggtitle("Box Plot") + 
  theme_light()

p2 <- ggplot(cat.dat) +
  geom_violin(aes(x=Type, y=AUC, color=Type), scale = "area") +
  xlab("Type") +
  ylab("AUC") +
  ggtitle("Violin Plot") + 
  theme_light()

grid.arrange(p1, p2, ncol=1)

Akt and mTOR exhibit quite a significant difference in AUC, which is another statistic that comes from the shape of the Akt and mTOR response curves.

3.4.5 kernel density of Substrate

Plot the kernel density of AUC by substrate.

p1 <- ggplot(cat.dat) +
  geom_density(aes(x=AUC, color=Type), n=1000, bw=0.025, trim=F) +
  scale_x_discrete(name="Type", limits=c(0,0.25,0.5,0.75,1.0,1.25)) +
  xlab("Type") +
  ylab("AUC") +
  ggtitle("Kernel Density of AUC") + 
  theme_light()

p1

Plot the kernel density of Avg.fold by substrate.

p1 <- ggplot(cat.dat) +
  geom_density(aes(x=Avg.Fold, color=Type), n=1000, trim=F) +
  
  xlab("Type") +
  ylab("Avg.Fold") +
  ggtitle("Kernel Density of Avg.Fold") + 
  theme_light()

p1

4. Classification Models

Four classification models were built:
1. Random Forest.
2. Support Vector Machine (SVM).
3. K-Nearest Neighbours (KNN).
4. Logistic Regression.

The purpose of building four models was to examine which model performed best. Two of the models are linear classifiers (SVM and Logistic Regression) and two models are non-linear (Random Forest and KNN). Feature creation was trialed and feature selection was also trialed with different models.

4.1 Random Forest model

4.1.1 Baseline Random Forest model

Create a baseline Random Forest model for prediction of Akt substrates and mTOR substrates

# Create a subset dataframe that contains samples with known Akt or mTOR (i.e. exclude all unlabeled samples).
# Remove identifier, seq. window and mTOR substrate flag from subset.
known.dat <- dat[dat$aktSubstrate == 1 | dat$mTORSubstrate == 1, c(-1,-2,-18)]
# This subset dataframe is reasonably balanced in term of Akt (22) and mTOR (26) samples so there is no need for up or down sampling for the baseline.

# Determine the maximum number of predictors that can be used in the random forest
max.predictors <- length(colnames(known.dat)) - 1 # Substracting one as the last column is the response variable

set.seed(1) # Set seed to randomise the rows included in each fold
folds <- createFolds(known.dat[,"aktSubstrate"], 10) # Create 10 folds for cross validation

# Create a matrix to hold the results from the baseline random forest.
# Trail from 1 to 100 number of trees (number of rows = 100)
# Trail from 1 to max.predictors randomly sampled canditate variables for each split
gridSearchResults <- matrix(nrow=100, ncol=max.predictors)


for (m in 1:100){
  for(k in 1:max.predictors){
    
    fold.accuracy <- NULL
    
    for (tst in folds){
      
      known.dat.train = known.dat[-tst,]
      known.dat.test = known.dat[tst,]
      
      rf.Baseline <- randomForest(aktSubstrate~., data=known.dat.train,  mtry=k, replace=TRUE, importance=TRUE, ntree=m)
    
      ypred <-  predict(rf.Baseline, newdata=known.dat.test)
      ypred.accuracy <- sum(ypred == known.dat.test$aktSubstrate)/nrow(known.dat.test)
      
      if (is.null(fold.accuracy)){
        fold.accuracy <-  ypred.accuracy
        
      }
      else {
        fold.accuracy <- fold.accuracy + ypred.accuracy
        
      }
      
      #print(ypred.accuracy)
    }
    avg.accuracy <- fold.accuracy / length(folds)
    gridSearchResults[m, k] <-  avg.accuracy
    
  }
}

4.1.2 Summarise and visualise the random forest results.

Baseline random forest model predicts aktSubstrate from mTOR with accuracy of 81.5% (min) to 97.5% (max) using existing 14 features

summary(gridSearchResults) # Summaries the model results
##        V1               V2               V3               V4        
##  Min.   :0.8150   Min.   :0.8817   Min.   :0.8717   Min.   :0.8600  
##  1st Qu.:0.9350   1st Qu.:0.9383   1st Qu.:0.9500   1st Qu.:0.9300  
##  Median :0.9550   Median :0.9550   Median :0.9550   Median :0.9550  
##  Mean   :0.9408   Mean   :0.9519   Mean   :0.9550   Mean   :0.9476  
##  3rd Qu.:0.9550   3rd Qu.:0.9550   3rd Qu.:0.9750   3rd Qu.:0.9550  
##  Max.   :0.9750   Max.   :1.0000   Max.   :0.9833   Max.   :0.9750  
##        V5               V6               V7               V8        
##  Min.   :0.8767   Min.   :0.8900   Min.   :0.8900   Min.   :0.8817  
##  1st Qu.:0.9300   1st Qu.:0.9300   1st Qu.:0.9133   1st Qu.:0.9100  
##  Median :0.9300   Median :0.9300   Median :0.9300   Median :0.9300  
##  Mean   :0.9339   Mean   :0.9325   Mean   :0.9270   Mean   :0.9233  
##  3rd Qu.:0.9500   3rd Qu.:0.9500   3rd Qu.:0.9300   3rd Qu.:0.9300  
##  Max.   :0.9750   Max.   :0.9750   Max.   :0.9750   Max.   :0.9550  
##        V9              V10              V11              V12        
##  Min.   :0.8767   Min.   :0.8900   Min.   :0.8733   Min.   :0.8900  
##  1st Qu.:0.9100   1st Qu.:0.9100   1st Qu.:0.9300   1st Qu.:0.9100  
##  Median :0.9300   Median :0.9300   Median :0.9300   Median :0.9300  
##  Mean   :0.9228   Mean   :0.9252   Mean   :0.9265   Mean   :0.9248  
##  3rd Qu.:0.9300   3rd Qu.:0.9300   3rd Qu.:0.9300   3rd Qu.:0.9300  
##  Max.   :0.9600   Max.   :0.9550   Max.   :0.9750   Max.   :0.9750  
##       V13              V14        
##  Min.   :0.8900   Min.   :0.8767  
##  1st Qu.:0.9100   1st Qu.:0.9300  
##  Median :0.9300   Median :0.9300  
##  Mean   :0.9236   Mean   :0.9258  
##  3rd Qu.:0.9300   3rd Qu.:0.9300  
##  Max.   :0.9300   Max.   :0.9500
p <- plot_ly(z = ~gridSearchResults) %>% add_surface()
p

4.1.3 Feature creation for Random Forest Model.

Sequence Window

Create new features for the Random Forest model to see if performance can be improved. The sequence window column contains the sequence of the phosphorylation site. It is a 13-amino acid sequence in which the 7th position corresponds to the amino acid that is phosphorylated. Breakout the sequence window into individual features.

require(stringr)

cat.dat$Seq.Window <- as.character(cat.dat$Seq.Window)

str_length(cat.dat[1,]$Seq.Window)
## [1] 13
# Create a new column for each of the each of the amino acids.
cat.dat$s1 <- as.factor(str_sub( cat.dat$Seq.Window, 1,1))
cat.dat$s2 <- as.factor(str_sub( cat.dat$Seq.Window, 2,2))
cat.dat$s3 <- as.factor(str_sub( cat.dat$Seq.Window, 3,3))
cat.dat$s4 <- as.factor(str_sub( cat.dat$Seq.Window, 4,4))
cat.dat$s5 <- as.factor(str_sub( cat.dat$Seq.Window, 5,5))
cat.dat$s6 <- as.factor(str_sub( cat.dat$Seq.Window, 6,6))
cat.dat$s7 <- as.factor(str_sub( cat.dat$Seq.Window, 7,7))
cat.dat$s8 <- as.factor(str_sub( cat.dat$Seq.Window, 8,8))
cat.dat$s9 <- as.factor(str_sub( cat.dat$Seq.Window, 9,9))
cat.dat$s10 <- as.factor(str_sub( cat.dat$Seq.Window, 10,10))
cat.dat$s11 <- as.factor(str_sub( cat.dat$Seq.Window, 11,11))
cat.dat$s12 <- as.factor(str_sub( cat.dat$Seq.Window, 12,12))
cat.dat$s13 <- as.factor(str_sub( cat.dat$Seq.Window, 13,13))

Visualise each seq character in the seq window.

Generate plots for each seq character in the seq window position from position 1 to 13

for (seqName in colnames(cat.dat)[12:24]){
  myplot <- ggplot(cat.dat, aes_string(x=seqName, group = "Type")) +
            geom_bar(aes(y = ..prop.., fill = Type), stat="count") +
            scale_y_continuous(labels=scales::percent) +
            ylab("relative frequencies") +
            ggtitle(seqName)+
            facet_grid(~Type)


  # myplot

  plotFilename <- paste("./Plots/seq_",seqName,".png",sep="_")
  ggsave(plotFilename, myplot, dpi=320, device="png")

  print(paste("saved plot to: ",plotFilename," ...",sep=""))

}
## Saving 7 x 5 in image
## [1] "saved plot to: ./Plots/seq__s1_.png ..."
## Saving 7 x 5 in image
## [1] "saved plot to: ./Plots/seq__s2_.png ..."
## Saving 7 x 5 in image
## [1] "saved plot to: ./Plots/seq__s3_.png ..."
## Saving 7 x 5 in image
## [1] "saved plot to: ./Plots/seq__s4_.png ..."
## Saving 7 x 5 in image
## [1] "saved plot to: ./Plots/seq__s5_.png ..."
## Saving 7 x 5 in image
## [1] "saved plot to: ./Plots/seq__s6_.png ..."
## Saving 7 x 5 in image
## [1] "saved plot to: ./Plots/seq__s7_.png ..."
## Saving 7 x 5 in image
## [1] "saved plot to: ./Plots/seq__s8_.png ..."
## Saving 7 x 5 in image
## [1] "saved plot to: ./Plots/seq__s9_.png ..."
## Saving 7 x 5 in image
## [1] "saved plot to: ./Plots/seq__s10_.png ..."
## Saving 7 x 5 in image
## [1] "saved plot to: ./Plots/seq__s11_.png ..."
## Saving 7 x 5 in image
## [1] "saved plot to: ./Plots/seq__s12_.png ..."
## Saving 7 x 5 in image
## [1] "saved plot to: ./Plots/seq__s13_.png ..."

Some sequence character character positions show patterns with their amino acid profile (e.g. 45% of S1 for Akt are F)

Rate of Change Feature

The visual exploration of the response data showed that Akt and mTOR have distinct profiles in the first four time segments. Features have been created to take advantage of this distinguishing attribute, and hopefully improve model performance.

dat.extra <- dat %>% mutate(diffP1 = X30s - X15s,
                            diffP2 = X1m - X30s,
                            diffP3 = X2m - X1m,
                            diffP4 = X5m - X2m,
                            diffP5 = X10m - X5m,
                            diffP6 = X20m - X10m,
                            diffP7 = X60m - X20m,
                            Type = as.factor(case_when(aktSubstrate==1~"Akt", mTORSubstrate == 1~"mTOR", TRUE~"Other")),
                            s1 = as.factor(str_sub( Seq.Window, 1,1)),
                            s2 = as.factor(str_sub( Seq.Window, 2,2)),
                            s3 = as.factor(str_sub( Seq.Window, 3,3)),
                            s4 = as.factor(str_sub( Seq.Window, 4,4)),
                            s5 = as.factor(str_sub( Seq.Window, 5,5)),
                            s6 = as.factor(str_sub( Seq.Window, 6,6)),
                            s7 = as.factor(str_sub( Seq.Window, 7,7)),
                            s8 = as.factor(str_sub( Seq.Window, 8,8)),
                            s9 = as.factor(str_sub( Seq.Window, 9,9)),
                            s10 = as.factor(str_sub( Seq.Window, 10,10)),
                            s11 = as.factor(str_sub( Seq.Window, 11,11)),
                            s12 = as.factor(str_sub( Seq.Window, 12,12)),
                            s13 = as.factor(str_sub( Seq.Window, 13,13))
                            )

4.1.4 Create Random Forest model using new features

Test the influence of the new features on the accuracy of the random forest model by comparing to the baseline accuracy.

# Use identical procedure as the baseline so the influence of the new feasures can be compared to the baseline
dat.extra.rf <- dat.extra %>% filter(Type != "Other") %>% select(-Identifier, -Seq.Window, -mTORSubstrate, -Type)
folds <- createFolds(dat.extra.rf$aktSubstrate, 10)

Grid search to find best parameters

max.predictors.dat.extra <- length(colnames(dat.extra.rf))-1

gridSearchResults.extra <- matrix(nrow=100, ncol=max.predictors.dat.extra)


for (m in 1:100){
  for(k in 1:max.predictors.dat.extra){

  fold.accuracy <- NULL
  
  for (tst in folds){
    
    dat.extra.rf.train = dat.extra.rf[-tst,]
    dat.extra.rf.test = dat.extra.rf[tst,]
    
    rf.ExtraFeatures <- randomForest(aktSubstrate~., data=dat.extra.rf.train,  mtry=k, replace=TRUE, importance=TRUE, ntree=m) # importance based on 
  
    ypred.prob <-  predict(rf.ExtraFeatures, newdata=dat.extra.rf.test, type="prob")
    ypred <- ifelse(ypred.prob[,2] > 0.5, 1, 0)
    ypred.accuracy <- sum(ypred == dat.extra.rf.test$aktSubstrate)/nrow(dat.extra.rf.test)
    
    if (is.null(fold.accuracy)){
      fold.accuracy <-  ypred.accuracy
      
    }
    else {
      fold.accuracy <- fold.accuracy + ypred.accuracy
      
    }
  
    }
    avg.accuracy <- fold.accuracy / length(folds)

    avg.accuracy    

    gridSearchResults.extra[m, k] <-  avg.accuracy
    
  }
}

# View(gridSearchResults.extra)
write.csv(gridSearchResults.extra, './Output/gridSearchResults.csv')
# 5 trees and 4 predictors is all that's needed

Visualise grid search results of revised random forest model

require(plotly)

p.extra <- plot_ly(z = ~gridSearchResults.extra) %>% add_surface()
p.extra
rf.ExtraFeatures$importance
##                   0           1 MeanDecreaseAccuracy MeanDecreaseGini
## Avg.Fold 0.00000000 0.000000000          0.000000000        0.0000000
## AUC      0.00500000 0.004761905          0.004862745        0.4065116
## X15s     0.00000000 0.000000000          0.000000000        0.0000000
## X30s     0.02229545 0.028321429          0.024428105        1.4381395
## X1m      0.09437671 0.133174603          0.105734554        5.2558140
## X2m      0.00000000 0.000000000          0.000000000        0.0000000
## X5m      0.00000000 0.000000000          0.000000000        0.0000000
## X10m     0.00000000 0.000000000          0.000000000        0.0000000
## X20m     0.00000000 0.000000000          0.000000000        0.0000000
## X60m     0.00000000 0.000000000          0.000000000        0.0000000
## Ins.1    0.00000000 0.000000000          0.000000000        0.0000000
## LY       0.00000000 0.000000000          0.000000000        0.0000000
## Ins.2    0.00000000 0.000000000          0.000000000        0.0000000
## MK       0.00000000 0.000000000          0.000000000        0.0000000
## diffP1   0.00000000 0.000000000          0.000000000        0.0000000
## diffP2   0.00000000 0.000000000          0.000000000        0.0000000
## diffP3   0.00000000 0.000000000          0.000000000        0.0000000
## diffP4   0.00250000 0.004285714          0.003157895        0.2139535
## diffP5   0.00000000 0.000000000          0.000000000        0.0000000
## diffP6   0.00000000 0.000000000          0.000000000        0.0000000
## diffP7   0.00000000 0.000000000          0.000000000        0.0000000
## s1       0.00000000 0.000000000          0.000000000        0.0000000
## s2       0.11496898 0.125702381          0.116270803        7.3841860
## s3       0.00000000 0.000000000          0.000000000        0.0000000
## s4       0.08455808 0.092642857          0.083578302        6.3293023
## s5       0.00000000 0.000000000          0.000000000        0.0000000
## s6       0.00000000 0.000000000          0.000000000        0.0000000
## s7       0.00000000 0.000000000          0.000000000        0.0000000
## s8       0.00000000 0.000000000          0.000000000        0.0000000
## s9       0.00000000 0.000000000          0.000000000        0.0000000
## s10      0.00000000 0.000000000          0.000000000        0.0000000
## s11      0.00000000 0.000000000          0.000000000        0.0000000
## s12      0.00000000 0.000000000          0.000000000        0.0000000
## s13      0.00000000 0.000000000          0.000000000        0.0000000
max(gridSearchResults.extra)
## [1] 1

4.1.5 Summary of Random Forest.

Below is the summary of Random forest model and proposal for next steps.
a) The average out of bag (OOB) estimate of error rate decreases from 9.09% to 6.98% with the new features added..
b) Some mtry and number of tree configurations achieve 100% accuracy..
c) Apply adaptive sampling procedure to the dataset in order to classify the unlabeled data..

4.2 SVM model

For the SVM model no features selection or creation will be used.

SVM model - The aktSubstrate as class

Prepare data for SVM model

names(dat)[c(1,2,18)] 
## [1] "Identifier"    "Seq.Window"    "mTORSubstrate"
# create an index to remember which rows are positively labelled as akt
aktIdx = which(dat$aktSubstrate==1)  
# create an index to remember which rows are positively labelled as mTOR
mTORIdx = which(dat$mTORSubstrate==1) 

aktSubstrate.dat <- dat[c(-1,-2,-18)]
# keep a copy for comparison later since we overwrite this feature in the line below
aktSubstrate.dat$aktSubstrate_Original = aktSubstrate.dat$aktSubstrate 
# this does not make "0" into 0, and "1" into 1
aktSubstrate.dat$aktSubstrate = as.numeric(as.factor(aktSubstrate.dat$aktSubstrate)) 

aktSubstrate.dat$aktSubstrate_corrected = as.numeric(as.factor(aktSubstrate.dat$aktSubstrate)) - 1

dfCompare <- data.frame(aktSubstrate.dat$aktSubstrate_Original, aktSubstrate.dat$aktSubstrate, aktSubstrate.dat$aktSubstrate_corrected)

names(dfCompare) <- c("Original", "Factor to Numeric Conversion", "Corrected")

# let's take a look at the original field, the numeric conversion, and the corrected numerical conversion.
# Note: The column types are listed under the column names, [factor, double, double]
head(dfCompare)
##   Original Factor to Numeric Conversion Corrected
## 1        0                            1         0
## 2        0                            1         0
## 3        0                            1         0
## 4        0                            1         0
## 5        0                            1         0
## 6        0                            1         0
dim(aktSubstrate.dat)
## [1] 12062    17

Build SVM model and calculate performance metrics

aktSubstrate_data.mat <- apply(X = aktSubstrate.dat[,-(15:17)], MARGIN = 2, FUN =  as.numeric) 

aktSubstrate_data.mat <- aktSubstrate_data.mat[c(aktIdx,mTORIdx),]#TKP Need to remove the unlabeled items, otherwise the class imbalance is too large for the SVM to train.

aktSubstrate_known.dat<- aktSubstrate_data.mat

attr(aktSubstrate_data.mat, "dimnames")[2]
## [[1]]
##  [1] "Avg.Fold" "AUC"      "X15s"     "X30s"     "X1m"      "X2m"     
##  [7] "X5m"      "X10m"     "X20m"     "X60m"     "Ins.1"    "LY"      
## [13] "Ins.2"    "MK"
data.cls.truth <- sapply(X = aktSubstrate.dat$aktSubstrate_corrected, FUN = function(x) {ifelse(x=="1", 1, 0)}) # try with the corrected truth class instead

data.cls.truth_akt <- data.cls.truth[c(aktIdx,mTORIdx)] # Need to match the indexed rows

rownames(aktSubstrate_data.mat) <- paste("p", 1:nrow(aktSubstrate_data.mat), sep="_")

#data.cls.truth


k <- 10
set.seed(1)
fold <- createFolds(data.cls.truth_akt, 10);
# gold standard (orignal data)
TP <- TN <- FP <- FN <- c()
for(i in 1:length(fold)){

    model <- svm(aktSubstrate_data.mat[-fold[[i]],], data.cls.truth_akt[-fold[[i]]])
    preds <- ifelse(predict(model, aktSubstrate_data.mat[fold[[i]],]) > 0.5, 1, 0)
    TP <- c(TP, sum((data.cls.truth_akt[fold[[i]]] == preds)[data.cls.truth_akt[fold[[i]]] == "1"]))
    TN <- c(TN, sum((data.cls.truth_akt[fold[[i]]] == preds)[data.cls.truth_akt[fold[[i]]] == "0"]))
    FP <- c(FP, sum((data.cls.truth_akt[fold[[i]]] != preds)[preds == "1"]))
    FN <- c(FN, sum((data.cls.truth_akt[fold[[i]]] != preds)[preds == "0"]))
    
}
mean(F1(cbind(TN, FP, TP, FN)))
## [1] 0.96
mean(Sen(cbind(TN, FP, TP, FN)))
## [1] 0.9333333
mean(Spe(cbind(TN, FP, TP, FN)))
## [1] 1
mean(Acc(cbind(TN, FP, TP, FN)))
## [1] 0.96

SVM model - The mTORSubstrate as class

Prepare data for SVM model

mTORSubstrate.dat <- dat[c(-1,-2,-17)]
# keep a copy for comparison later since we overwrite this feature in the line below
mTORSubstrate.dat$mTORSubstrate_Original = mTORSubstrate.dat$mTORSubstrate 
# this does not make "0" into 0, and "1" into 1
mTORSubstrate.dat$mTORSubstrate = as.numeric(as.factor(mTORSubstrate.dat$mTORSubstrate)) 

mTORSubstrate.dat$mTORSubstrate_corrected = as.numeric(as.factor(mTORSubstrate.dat$mTORSubstrate)) - 1

dfCompare <- data.frame(mTORSubstrate.dat$mTORSubstrate_Original, mTORSubstrate.dat$mTORSubstrate, mTORSubstrate.dat$mTORSubstrate_corrected)

names(dfCompare) <- c("Original", "Factor to Numeric Conversion", "Corrected")

# let's take a look at the original field, the numeric conversion, and the corrected numerical conversion.
# Note: The column types are listed under the column names, [factor, double, double]
head(dfCompare)
##   Original Factor to Numeric Conversion Corrected
## 1        0                            1         0
## 2        0                            1         0
## 3        0                            1         0
## 4        0                            1         0
## 5        0                            1         0
## 6        0                            1         0
dim(mTORSubstrate.dat)
## [1] 12062    17

Build SVM model and calculate performance metrics

mTORSubstrate_data.mat <- apply(X = mTORSubstrate.dat[,-(15:17)], MARGIN = 2, FUN =  as.numeric) #TKP slightly modified to also drop cols 16 and 17 since I added two additional columns to create dfCompare above.

mTORSubstrate_data.mat <- mTORSubstrate_data.mat[c(aktIdx,mTORIdx),]#TKP Need to remove the unlabeled items, otherwise the class imbalance is too large for the SVM to train.

mTORSubstrate_known.dat<- mTORSubstrate_data.mat

attr(mTORSubstrate_data.mat, "dimnames")[2]
## [[1]]
##  [1] "Avg.Fold" "AUC"      "X15s"     "X30s"     "X1m"      "X2m"     
##  [7] "X5m"      "X10m"     "X20m"     "X60m"     "Ins.1"    "LY"      
## [13] "Ins.2"    "MK"
data.cls.truth_mTOR <- sapply(X = mTORSubstrate.dat$mTORSubstrate_corrected, FUN = function(x) {ifelse(x=="1", 1, 0)}) #TKP try with the corrected truth class instead

data.cls.truth_mTOR <- data.cls.truth_mTOR[c(aktIdx,mTORIdx)] # Need to match the indexed rows

rownames(mTORSubstrate_data.mat) <- paste("p", 1:nrow(mTORSubstrate_data.mat), sep="_")

#data.cls.truth

k <- 10
set.seed(1)
fold <- createFolds(data.cls.truth_mTOR, 10);
# gold standard (orignal data)
TP <- TN <- FP <- FN <- c()
for(i in 1:length(fold)){
    model <- svm(mTORSubstrate_data.mat[-fold[[i]],], data.cls.truth_mTOR[-fold[[i]]])
    preds <- ifelse(predict(model, mTORSubstrate_data.mat[fold[[i]],]) > 0.5, 1, 0)
    TP <- c(TP, sum((data.cls.truth_mTOR[fold[[i]]] == preds)[data.cls.truth_mTOR[fold[[i]]] == "1"]))
    TN <- c(TN, sum((data.cls.truth_mTOR[fold[[i]]] == preds)[data.cls.truth_mTOR[fold[[i]]] == "0"]))
    FP <- c(FP, sum((data.cls.truth_mTOR[fold[[i]]] != preds)[preds == "1"]))
    FN <- c(FN, sum((data.cls.truth_mTOR[fold[[i]]] != preds)[preds == "0"]))
}
mean(F1(cbind(TN, FP, TP, FN)))
## [1] 0.96
mean(Sen(cbind(TN, FP, TP, FN)))
## [1] 1
mean(Spe(cbind(TN, FP, TP, FN)))
## [1] 0.9333333
mean(Acc(cbind(TN, FP, TP, FN)))
## [1] 0.96

4.3 KNN Model

KNN model was built using wrapper method for feature selection to see if feature selection could remove noise from the data and improve model performance.
#### KNN Model for aktSubstrate as class:

selectFeature <- function(train, test, cls.train, cls.test, features) {
  ## identify a feature to be selected
  current.best.accuracy <- -Inf
  selected.i <- NULL
  for(i in 1:ncol(train)) {
    current.f <- colnames(train)[i]
    if(!current.f %in% features) {
      model <- knn(train=train[,c(features, current.f)], test=test[,c(features, current.f)], cl=cls.train, k=9)
      test.acc <- sum(model == cls.test) / length(cls.test)
      
      if(test.acc > current.best.accuracy) {
        current.best.accuracy <- test.acc
        selected.i <- colnames(train)[i]
      }
    }
  }
  return(selected.i)
}



set.seed(1)
inTrain <- createDataPartition(data.cls.truth_akt, p = .6)[[1]]
allFeatures <- colnames(aktSubstrate_data.mat)
train <- aktSubstrate_data.mat[ inTrain,]
test  <- aktSubstrate_data.mat[-inTrain,]
cls.train <- data.cls.truth_akt[inTrain]
cls.test <- data.cls.truth_akt[-inTrain]

# use correlation to determine the first feature
cls.train.numeric <- rep(c(0, 1), c(sum(cls.train == "0"), sum(cls.train == "1")))
features <- c()
current.best.cor <- 0
for(i in 1:ncol(train)) {
  if(current.best.cor < abs(cor(train[,i], cls.train.numeric))) {
    current.best.cor <- abs(cor(train[,i], cls.train.numeric))
    features <- colnames(train)[i]
  }
}
print(features)
## [1] "LY"
for (j in 2:10) {
  selected.i <- selectFeature(train, test, cls.train, cls.test, features)
  print(selected.i)

  # add the best feature from current run
  features <- c(features, selected.i)
}
## [1] "X1m"
## [1] "AUC"
## [1] "X15s"
## [1] "Avg.Fold"
## [1] "X30s"
## [1] "X2m"
## [1] "X10m"
## [1] "X60m"
## [1] "Ins.1"
features_selection_aktSubstrate_data.mat <- aktSubstrate_data.mat[,features]

k <- 10
set.seed(1)
fold <- createFolds(data.cls.truth_akt, 10);
# gold standard (orignal data)
TP <- TN <- FP <- FN <- c()
for(i in 1:length(fold)){
  
    #model <- svm(features_selection_aktSubstrate_data.mat[-fold[[i]],], data.cls.truth_akt[-fold[[i]]])
  preds <- knn (train=features_selection_aktSubstrate_data.mat[-fold[[i]],], test=features_selection_aktSubstrate_data.mat[fold[[i]],], cl=data.cls.truth_akt[-fold[[i]]], k=9)
  
    #preds <- ifelse(predict(model, features_selection_aktSubstrate_data.mat[fold[[i]],]) > 0.5, 1, 0)
    TP <- c(TP, sum((data.cls.truth_akt[fold[[i]]] == preds)[data.cls.truth_akt[fold[[i]]] == "1"]))
    TN <- c(TN, sum((data.cls.truth_akt[fold[[i]]] == preds)[data.cls.truth_akt[fold[[i]]] == "0"]))
    FP <- c(FP, sum((data.cls.truth_akt[fold[[i]]] != preds)[preds == "1"]))
    FN <- c(FN, sum((data.cls.truth_akt[fold[[i]]] != preds)[preds == "0"]))
    
}
mean(F1(cbind(TN, FP, TP, FN)))
## [1] 0.96
mean(Sen(cbind(TN, FP, TP, FN)))
## [1] 0.9333333
mean(Spe(cbind(TN, FP, TP, FN)))
## [1] 1
mean(Acc(cbind(TN, FP, TP, FN)))
## [1] 0.96

KNN Model for mTORSubstrate as class:

set.seed(1)
inTrain <- createDataPartition(data.cls.truth_mTOR, p = .6)[[1]]
allFeatures <- colnames(mTORSubstrate_data.mat)
train <- mTORSubstrate_data.mat[ inTrain,]
test  <- mTORSubstrate_data.mat[-inTrain,]
cls.train <- data.cls.truth_mTOR[inTrain]
cls.test <- data.cls.truth_mTOR[-inTrain]

# use correlation to determine the first feature
cls.train.numeric <- rep(c(0, 1), c(sum(cls.train == "0"), sum(cls.train == "1")))
features <- c()
current.best.cor <- 0
for(i in 1:ncol(train)) {
  if(current.best.cor < abs(cor(train[,i], cls.train.numeric))) {
    current.best.cor <- abs(cor(train[,i], cls.train.numeric))
    features <- colnames(train)[i]
  }
}
print(features)
## [1] "X30s"
for (j in 2:10) {
  selected.i <- selectFeature(train, test, cls.train, cls.test, features)
  print(selected.i)

  # add the best feature from current run
  features <- c(features, selected.i)
}
## [1] "Avg.Fold"
## [1] "AUC"
## [1] "X15s"
## [1] "X1m"
## [1] "LY"
## [1] "X2m"
## [1] "X10m"
## [1] "X60m"
## [1] "Ins.1"
features_selection_mTORSubstrate_data.mat <- mTORSubstrate_data.mat[,features]

k <- 10
set.seed(1)
fold <- createFolds(data.cls.truth_mTOR, 10);
# gold standard (orignal data)
TP <- TN <- FP <- FN <- c()
for(i in 1:length(fold)){
  
    #model <- svm(features_selection_mTORSubstrate_data.mat[-fold[[i]],], data.cls.truth_mTOR[-fold[[i]]])
    preds <- knn (train=features_selection_mTORSubstrate_data.mat[-fold[[i]],], test=features_selection_mTORSubstrate_data.mat[fold[[i]],], cl=data.cls.truth_mTOR[-fold[[i]]], k=9)
      
    #preds <- ifelse(predict(model, features_selection_mTORSubstrate_data.mat[fold[[i]],]) > 0.5, 1, 0)
    TP <- c(TP, sum((data.cls.truth_mTOR[fold[[i]]] == preds)[data.cls.truth_mTOR[fold[[i]]] == "1"]))
    TN <- c(TN, sum((data.cls.truth_mTOR[fold[[i]]] == preds)[data.cls.truth_mTOR[fold[[i]]] == "0"]))
    FP <- c(FP, sum((data.cls.truth_mTOR[fold[[i]]] != preds)[preds == "1"]))
    FN <- c(FN, sum((data.cls.truth_mTOR[fold[[i]]] != preds)[preds == "0"]))
    
}
mean(F1(cbind(TN, FP, TP, FN)))
## [1] 0.96
mean(Sen(cbind(TN, FP, TP, FN)))
## [1] 1
mean(Spe(cbind(TN, FP, TP, FN)))
## [1] 0.9333333
mean(Acc(cbind(TN, FP, TP, FN)))
## [1] 0.96

Feature selection did not cause the KNN model to perform at higher accuracy based on known samples.

4.4 Logistic regression Model.

The logistic regression model was build using step-wise feature selection to see if feature selection could remove noise from the data and improve model performance.

4.4.1 Baseline Logistic regression model for aktSubstrate class

glm.akt<-glm(dat$aktSubstrate~dat$Avg.Fold+dat$AUC+dat$X15s+dat$X30s+dat$X1m+dat$X2m+dat$X5m+dat$X10m+dat$X20m+dat$X60m+dat$Ins.1+dat$LY+dat$Ins.2+dat$MK,data=dat,family = binomial)
summary(glm.akt)
## 
## Call:
## glm(formula = dat$aktSubstrate ~ dat$Avg.Fold + dat$AUC + dat$X15s + 
##     dat$X30s + dat$X1m + dat$X2m + dat$X5m + dat$X10m + dat$X20m + 
##     dat$X60m + dat$Ins.1 + dat$LY + dat$Ins.2 + dat$MK, family = binomial, 
##     data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4773  -0.0288  -0.0146  -0.0075   4.2066  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.707e+00  1.107e+00  -1.542    0.123    
## dat$Avg.Fold  1.274e+08  1.252e+08   1.018    0.309    
## dat$AUC      -1.497e+01  3.214e+00  -4.657 3.21e-06 ***
## dat$X15s     -1.593e+07  1.565e+07  -1.018    0.309    
## dat$X30s     -1.593e+07  1.565e+07  -1.018    0.309    
## dat$X1m      -1.593e+07  1.565e+07  -1.018    0.309    
## dat$X2m      -1.593e+07  1.565e+07  -1.018    0.309    
## dat$X5m      -1.593e+07  1.565e+07  -1.018    0.309    
## dat$X10m     -1.593e+07  1.565e+07  -1.018    0.309    
## dat$X20m     -1.593e+07  1.565e+07  -1.018    0.309    
## dat$X60m     -1.593e+07  1.565e+07  -1.018    0.309    
## dat$Ins.1    -2.659e-01  3.362e-01  -0.791    0.429    
## dat$LY       -4.172e-01  3.064e-01  -1.362    0.173    
## dat$Ins.2     3.409e-01  3.236e-01   1.054    0.292    
## dat$MK        2.495e-01  2.580e-01   0.967    0.334    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 321.46  on 12061  degrees of freedom
## Residual deviance: 175.06  on 12047  degrees of freedom
## AIC: 205.06
## 
## Number of Fisher Scoring iterations: 11

4.4.2 Feature selection using stepwise for aktSubstrate class

#feature selection to build the model step by step
glm.akt1<-step(glm.akt)
## Start:  AIC=205.06
## dat$aktSubstrate ~ dat$Avg.Fold + dat$AUC + dat$X15s + dat$X30s + 
##     dat$X1m + dat$X2m + dat$X5m + dat$X10m + dat$X20m + dat$X60m + 
##     dat$Ins.1 + dat$LY + dat$Ins.2 + dat$MK
## 
##                Df Deviance    AIC
## - dat$X2m       1   175.40 203.40
## - dat$X60m      1   175.40 203.40
## - dat$X10m      1   175.40 203.40
## - dat$X30s      1   175.40 203.40
## - dat$Avg.Fold  1   175.40 203.40
## - dat$X5m       1   175.40 203.40
## - dat$X1m       1   175.40 203.40
## - dat$X15s      1   175.40 203.40
## - dat$X20m      1   175.40 203.40
## - dat$Ins.1     1   175.62 203.62
## - dat$MK        1   175.97 203.97
## - dat$Ins.2     1   176.21 204.21
## - dat$LY        1   176.80 204.80
## <none>              175.06 205.06
## - dat$AUC       1   204.03 232.03
## 
## Step:  AIC=203.4
## dat$aktSubstrate ~ dat$Avg.Fold + dat$AUC + dat$X15s + dat$X30s + 
##     dat$X1m + dat$X5m + dat$X10m + dat$X20m + dat$X60m + dat$Ins.1 + 
##     dat$LY + dat$Ins.2 + dat$MK
## 
##                Df Deviance    AIC
## - dat$X60m      1   175.40 201.40
## - dat$X10m      1   175.52 201.52
## - dat$X30s      1   175.53 201.53
## - dat$Ins.1     1   175.94 201.94
## - dat$X5m       1   176.10 202.10
## - dat$MK        1   176.27 202.27
## - dat$X1m       1   176.37 202.37
## - dat$Ins.2     1   176.54 202.54
## - dat$LY        1   177.15 203.15
## <none>              175.40 203.40
## - dat$Avg.Fold  1   177.46 203.46
## - dat$X15s      1   179.55 205.55
## - dat$X20m      1   180.10 206.10
## - dat$AUC       1   204.24 230.24
## 
## Step:  AIC=201.4
## dat$aktSubstrate ~ dat$Avg.Fold + dat$AUC + dat$X15s + dat$X30s + 
##     dat$X1m + dat$X5m + dat$X10m + dat$X20m + dat$Ins.1 + dat$LY + 
##     dat$Ins.2 + dat$MK
## 
##                Df Deviance    AIC
## - dat$X10m      1   175.53 199.53
## - dat$X30s      1   175.59 199.59
## - dat$Ins.1     1   175.95 199.95
## - dat$MK        1   176.29 200.29
## - dat$Ins.2     1   176.54 200.54
## - dat$X5m       1   176.65 200.65
## - dat$X1m       1   177.06 201.06
## - dat$LY        1   177.16 201.16
## <none>              175.40 201.40
## - dat$X20m      1   180.66 204.66
## - dat$Avg.Fold  1   180.87 204.87
## - dat$X15s      1   181.32 205.32
## - dat$AUC       1   204.90 228.90
## 
## Step:  AIC=199.53
## dat$aktSubstrate ~ dat$Avg.Fold + dat$AUC + dat$X15s + dat$X30s + 
##     dat$X1m + dat$X5m + dat$X20m + dat$Ins.1 + dat$LY + dat$Ins.2 + 
##     dat$MK
## 
##                Df Deviance    AIC
## - dat$X30s      1   175.62 197.62
## - dat$Ins.1     1   176.07 198.07
## - dat$MK        1   176.35 198.35
## - dat$Ins.2     1   176.71 198.71
## - dat$X5m       1   176.78 198.78
## - dat$X1m       1   177.06 199.06
## - dat$LY        1   177.24 199.24
## <none>              175.53 199.53
## - dat$X20m      1   180.71 202.71
## - dat$Avg.Fold  1   181.69 203.69
## - dat$X15s      1   181.88 203.88
## - dat$AUC       1   205.00 227.00
## 
## Step:  AIC=197.62
## dat$aktSubstrate ~ dat$Avg.Fold + dat$AUC + dat$X15s + dat$X1m + 
##     dat$X5m + dat$X20m + dat$Ins.1 + dat$LY + dat$Ins.2 + dat$MK
## 
##                Df Deviance    AIC
## - dat$Ins.1     1   176.19 196.19
## - dat$MK        1   176.59 196.59
## - dat$Ins.2     1   176.78 196.78
## - dat$X5m       1   176.80 196.80
## - dat$X1m       1   177.14 197.14
## - dat$LY        1   177.29 197.29
## <none>              175.62 197.62
## - dat$X20m      1   180.97 200.97
## - dat$X15s      1   181.97 201.97
## - dat$Avg.Fold  1   182.71 202.71
## - dat$AUC       1   207.84 227.84
## 
## Step:  AIC=196.19
## dat$aktSubstrate ~ dat$Avg.Fold + dat$AUC + dat$X15s + dat$X1m + 
##     dat$X5m + dat$X20m + dat$LY + dat$Ins.2 + dat$MK
## 
##                Df Deviance    AIC
## - dat$Ins.2     1   176.92 194.92
## - dat$MK        1   176.93 194.93
## - dat$X5m       1   177.44 195.44
## - dat$X1m       1   177.56 195.56
## <none>              176.19 196.19
## - dat$LY        1   178.24 196.24
## - dat$X20m      1   181.45 199.45
## - dat$X15s      1   182.16 200.16
## - dat$Avg.Fold  1   182.81 200.81
## - dat$AUC       1   208.42 226.42
## 
## Step:  AIC=194.92
## dat$aktSubstrate ~ dat$Avg.Fold + dat$AUC + dat$X15s + dat$X1m + 
##     dat$X5m + dat$X20m + dat$LY + dat$MK
## 
##                Df Deviance    AIC
## - dat$X5m       1   178.15 194.15
## - dat$X1m       1   178.35 194.35
## - dat$MK        1   178.43 194.43
## <none>              176.92 194.92
## - dat$LY        1   178.95 194.95
## - dat$X20m      1   181.99 197.99
## - dat$X15s      1   183.43 199.43
## - dat$Avg.Fold  1   184.19 200.19
## - dat$AUC       1   209.12 225.12
## 
## Step:  AIC=194.15
## dat$aktSubstrate ~ dat$Avg.Fold + dat$AUC + dat$X15s + dat$X1m + 
##     dat$X20m + dat$LY + dat$MK
## 
##                Df Deviance    AIC
## - dat$X1m       1   178.83 192.83
## - dat$MK        1   179.75 193.75
## - dat$LY        1   179.94 193.94
## <none>              178.15 194.15
## - dat$X20m      1   182.08 196.08
## - dat$X15s      1   184.19 198.19
## - dat$Avg.Fold  1   185.86 199.86
## - dat$AUC       1   209.31 223.31
## 
## Step:  AIC=192.83
## dat$aktSubstrate ~ dat$Avg.Fold + dat$AUC + dat$X15s + dat$X20m + 
##     dat$LY + dat$MK
## 
##                Df Deviance    AIC
## - dat$MK        1   180.47 192.47
## - dat$LY        1   180.51 192.51
## <none>              178.83 192.83
## - dat$X20m      1   182.26 194.26
## - dat$X15s      1   184.43 196.43
## - dat$Avg.Fold  1   193.54 205.54
## - dat$AUC       1   209.46 221.46
## 
## Step:  AIC=192.47
## dat$aktSubstrate ~ dat$Avg.Fold + dat$AUC + dat$X15s + dat$X20m + 
##     dat$LY
## 
##                Df Deviance    AIC
## - dat$LY        1   181.04 191.04
## <none>              180.47 192.47
## - dat$X20m      1   183.41 193.41
## - dat$X15s      1   186.17 196.17
## - dat$Avg.Fold  1   194.14 204.14
## - dat$AUC       1   209.98 219.98
## 
## Step:  AIC=191.04
## dat$aktSubstrate ~ dat$Avg.Fold + dat$AUC + dat$X15s + dat$X20m
## 
##                Df Deviance    AIC
## <none>              181.04 191.04
## - dat$X20m      1   183.74 191.74
## - dat$X15s      1   187.84 195.84
## - dat$Avg.Fold  1   194.17 202.17
## - dat$AUC       1   210.32 218.32
summary(glm.akt1)
## 
## Call:
## glm(formula = dat$aktSubstrate ~ dat$Avg.Fold + dat$AUC + dat$X15s + 
##     dat$X20m, family = binomial, data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5329  -0.0325  -0.0177  -0.0098   4.0712  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -2.1601     1.0160  -2.126  0.03350 *  
## dat$Avg.Fold   2.4666     0.5944   4.150 3.32e-05 ***
## dat$AUC      -13.2065     2.8361  -4.657 3.21e-06 ***
## dat$X15s      -0.7095     0.2590  -2.739  0.00616 ** 
## dat$X20m      -0.7859     0.4447  -1.768  0.07714 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 321.46  on 12061  degrees of freedom
## Residual deviance: 181.04  on 12057  degrees of freedom
## AIC: 191.04
## 
## Number of Fisher Scoring iterations: 11
#explain the model
coef(glm.akt1)
##  (Intercept) dat$Avg.Fold      dat$AUC     dat$X15s     dat$X20m 
##   -2.1600802    2.4665739  -13.2065475   -0.7094784   -0.7859271
exp(glm.akt1$coefficients)#the relationship between odds and x
##  (Intercept) dat$Avg.Fold      dat$AUC     dat$X15s     dat$X20m 
## 1.153159e-01 1.178201e+01 1.838524e-06 4.919007e-01 4.556970e-01
exp(confint(glm.akt1))#Confidence interval
## Waiting for profiling to be done...
##                     2.5 %       97.5 %
## (Intercept)  1.490772e-02 8.100763e-01
## dat$Avg.Fold 3.274079e+00 3.474837e+01
## dat$AUC      4.830945e-09 3.428494e-04
## dat$X15s     3.024994e-01 8.339941e-01
## dat$X20m     2.025290e-01 1.174675e+00
xp0.5<-0/glm.akt1$coefficients[]
xp0.5#return x which makes pi equal to 0.5
##  (Intercept) dat$Avg.Fold      dat$AUC     dat$X15s     dat$X20m 
##            0            0            0            0            0
ratio05<-glm.akt1$coefficients[]*0.25
ratio05
##  (Intercept) dat$Avg.Fold      dat$AUC     dat$X15s     dat$X20m 
##   -0.5400201    0.6166435   -3.3016369   -0.1773696   -0.1964818

4.4.3 Evaluate Logistic Regression baseline model for aktSubstrate class

#evaluate the model
R2cox<-1-exp((glm.akt1$deviance-glm.akt1$null.deviance)/length(dat$aktSubstrate))
cat("Cox-Snell R2=",R2cox,"\n")
## Cox-Snell R2= 0.01157409
R2nag<-R2cox/(1-exp((-glm.akt1$null.deviance)/length(dat$aktSubstrate)))
R2nag
## [1] 0.440105
cat("Nagelkerke R2=",R2nag,"\n")
## Nagelkerke R2= 0.440105
plot(residuals(glm.akt1))

#install.packages('car')

influencePlot(glm.akt1)

##         StudRes          Hat      CookD
## 1226  2.3888727 0.0250214253 0.07145550
## 2101  3.4312407 0.0003102064 0.02114883
## 3764  4.1273076 0.0001158346 0.09203907
## 5797 -0.9838078 0.1136121244 0.01633371
## 9115 -0.6180272 0.3295702238 0.02389170

Summary of evaluation:

fitted.pi<-fitted(glm.akt1)
ypred<-1*(fitted.pi>0.5)
length(ypred)
## [1] 12062
n<-table(dat$aktSubstrate,ypred)
n
##    ypred
##         0     1
##   0 12037     3
##   1    22     0
cat("sensitivity=",n[2,2]/sum(n[2,]),"\n")
## sensitivity= 0
cat("specificity=",n[1,1]/sum(n[1,]),"\n")
## specificity= 0.9997508

4.4.4 Baseline Logistic Regression model for mTORSubstrate class

#create the logistic regression taking all the features
glm.mtor<-glm(dat$mTORSubstrate~dat$Avg.Fold+dat$AUC+dat$X15s+dat$X30s+dat$X1m+dat$X2m+dat$X5m+dat$X10m+dat$X20m+dat$X60m+dat$Ins.1+dat$LY+dat$Ins.2+dat$MK,data=dat,family = binomial)
summary(glm.mtor)
## 
## Call:
## glm(formula = dat$mTORSubstrate ~ dat$Avg.Fold + dat$AUC + dat$X15s + 
##     dat$X30s + dat$X1m + dat$X2m + dat$X5m + dat$X10m + dat$X20m + 
##     dat$X60m + dat$Ins.1 + dat$LY + dat$Ins.2 + dat$MK, family = binomial, 
##     data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8940  -0.0515  -0.0380  -0.0304   3.7002  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -8.135e+00  1.265e+00  -6.431 1.27e-10 ***
## dat$Avg.Fold -5.643e+07  5.946e+08  -0.095    0.924    
## dat$AUC       1.448e+00  2.263e+00   0.640    0.522    
## dat$X15s      7.053e+06  7.432e+07   0.095    0.924    
## dat$X30s      7.053e+06  7.432e+07   0.095    0.924    
## dat$X1m       7.053e+06  7.432e+07   0.095    0.924    
## dat$X2m       7.053e+06  7.432e+07   0.095    0.924    
## dat$X5m       7.053e+06  7.432e+07   0.095    0.924    
## dat$X10m      7.053e+06  7.432e+07   0.095    0.924    
## dat$X20m      7.053e+06  7.432e+07   0.095    0.924    
## dat$X60m      7.053e+06  7.432e+07   0.095    0.924    
## dat$Ins.1     4.525e-01  2.928e-01   1.546    0.122    
## dat$LY       -1.042e+00  1.722e-01  -6.052 1.43e-09 ***
## dat$Ins.2     1.217e-01  3.205e-01   0.380    0.704    
## dat$MK        4.010e-01  2.664e-01   1.505    0.132    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 371.21  on 12061  degrees of freedom
## Residual deviance: 283.01  on 12047  degrees of freedom
## AIC: 313.01
## 
## Number of Fisher Scoring iterations: 10

4.4.5 Feature selection using stepwise for mTORSubstrate class

#feature selection to build the model step by step
glm.mtor1<-step(glm.mtor)
## Start:  AIC=313.01
## dat$mTORSubstrate ~ dat$Avg.Fold + dat$AUC + dat$X15s + dat$X30s + 
##     dat$X1m + dat$X2m + dat$X5m + dat$X10m + dat$X20m + dat$X60m + 
##     dat$Ins.1 + dat$LY + dat$Ins.2 + dat$MK
## 
##                Df Deviance    AIC
## - dat$X30s      1   283.02 311.02
## - dat$X10m      1   283.02 311.02
## - dat$X1m       1   283.02 311.02
## - dat$X2m       1   283.02 311.02
## - dat$X60m      1   283.02 311.02
## - dat$Avg.Fold  1   283.02 311.02
## - dat$X15s      1   283.02 311.02
## - dat$X5m       1   283.02 311.02
## - dat$X20m      1   283.02 311.02
## - dat$Ins.2     1   283.15 311.15
## - dat$AUC       1   283.42 311.42
## <none>              283.01 313.01
## - dat$MK        1   285.26 313.26
## - dat$Ins.1     1   285.59 313.59
## - dat$LY        1   315.08 343.08
## 
## Step:  AIC=311.02
## dat$mTORSubstrate ~ dat$Avg.Fold + dat$AUC + dat$X15s + dat$X1m + 
##     dat$X2m + dat$X5m + dat$X10m + dat$X20m + dat$X60m + dat$Ins.1 + 
##     dat$LY + dat$Ins.2 + dat$MK
## 
##                Df Deviance    AIC
## - dat$Ins.2     1   283.16 309.16
## - dat$AUC       1   283.43 309.43
## - dat$X1m       1   283.86 309.86
## - dat$X2m       1   284.11 310.11
## - dat$X10m      1   284.18 310.18
## - dat$X60m      1   284.74 310.74
## <none>              283.02 311.02
## - dat$X15s      1   285.10 311.10
## - dat$MK        1   285.27 311.27
## - dat$Avg.Fold  1   285.42 311.42
## - dat$Ins.1     1   285.62 311.62
## - dat$X5m       1   285.96 311.97
## - dat$X20m      1   286.14 312.14
## - dat$LY        1   315.08 341.08
## 
## Step:  AIC=309.16
## dat$mTORSubstrate ~ dat$Avg.Fold + dat$AUC + dat$X15s + dat$X1m + 
##     dat$X2m + dat$X5m + dat$X10m + dat$X20m + dat$X60m + dat$Ins.1 + 
##     dat$LY + dat$MK
## 
##                Df Deviance    AIC
## - dat$AUC       1   283.58 307.58
## - dat$X1m       1   284.04 308.04
## - dat$X2m       1   284.21 308.21
## - dat$X10m      1   284.32 308.32
## - dat$X60m      1   284.95 308.95
## <none>              283.16 309.16
## - dat$X15s      1   285.19 309.19
## - dat$Avg.Fold  1   285.55 309.55
## - dat$X5m       1   286.13 310.13
## - dat$X20m      1   286.41 310.41
## - dat$MK        1   286.45 310.45
## - dat$Ins.1     1   288.55 312.55
## - dat$LY        1   319.14 343.14
## 
## Step:  AIC=307.58
## dat$mTORSubstrate ~ dat$Avg.Fold + dat$X15s + dat$X1m + dat$X2m + 
##     dat$X5m + dat$X10m + dat$X20m + dat$X60m + dat$Ins.1 + dat$LY + 
##     dat$MK
## 
##                Df Deviance    AIC
## - dat$X1m       1   284.45 306.45
## - dat$X2m       1   284.50 306.50
## - dat$X10m      1   284.59 306.59
## <none>              283.58 307.58
## - dat$X15s      1   285.58 307.58
## - dat$X60m      1   285.59 307.59
## - dat$Avg.Fold  1   285.98 307.98
## - dat$X5m       1   286.38 308.38
## - dat$X20m      1   286.94 308.94
## - dat$MK        1   286.94 308.94
## - dat$Ins.1     1   288.72 310.72
## - dat$LY        1   319.76 341.76
## 
## Step:  AIC=306.45
## dat$mTORSubstrate ~ dat$Avg.Fold + dat$X15s + dat$X2m + dat$X5m + 
##     dat$X10m + dat$X20m + dat$X60m + dat$Ins.1 + dat$LY + dat$MK
## 
##                Df Deviance    AIC
## - dat$X2m       1   284.72 304.72
## - dat$X10m      1   284.77 304.77
## - dat$X15s      1   285.63 305.63
## - dat$X60m      1   285.65 305.65
## <none>              284.45 306.45
## - dat$X5m       1   286.45 306.45
## - dat$Avg.Fold  1   286.55 306.55
## - dat$X20m      1   287.48 307.48
## - dat$MK        1   287.91 307.91
## - dat$Ins.1     1   290.29 310.29
## - dat$LY        1   320.13 340.13
## 
## Step:  AIC=304.72
## dat$mTORSubstrate ~ dat$Avg.Fold + dat$X15s + dat$X5m + dat$X10m + 
##     dat$X20m + dat$X60m + dat$Ins.1 + dat$LY + dat$MK
## 
##                Df Deviance    AIC
## - dat$X10m      1   284.86 302.86
## - dat$X60m      1   285.65 303.65
## - dat$X15s      1   285.68 303.68
## - dat$X5m       1   286.52 304.52
## <none>              284.72 304.72
## - dat$X20m      1   287.55 305.55
## - dat$MK        1   288.11 306.11
## - dat$Avg.Fold  1   288.15 306.15
## - dat$Ins.1     1   290.34 308.34
## - dat$LY        1   320.21 338.21
## 
## Step:  AIC=302.86
## dat$mTORSubstrate ~ dat$Avg.Fold + dat$X15s + dat$X5m + dat$X20m + 
##     dat$X60m + dat$Ins.1 + dat$LY + dat$MK
## 
##                Df Deviance    AIC
## - dat$X15s      1   285.73 301.73
## - dat$X60m      1   285.77 301.77
## - dat$X5m       1   286.83 302.83
## <none>              284.86 302.86
## - dat$Avg.Fold  1   288.16 304.16
## - dat$X20m      1   288.31 304.31
## - dat$MK        1   288.91 304.91
## - dat$Ins.1     1   291.02 307.02
## - dat$LY        1   321.54 337.54
## 
## Step:  AIC=301.73
## dat$mTORSubstrate ~ dat$Avg.Fold + dat$X5m + dat$X20m + dat$X60m + 
##     dat$Ins.1 + dat$LY + dat$MK
## 
##                Df Deviance    AIC
## - dat$X60m      1   286.57 300.57
## - dat$X5m       1   286.88 300.88
## <none>              285.73 301.73
## - dat$Avg.Fold  1   288.26 302.26
## - dat$X20m      1   288.41 302.41
## - dat$MK        1   290.07 304.07
## - dat$Ins.1     1   291.40 305.40
## - dat$LY        1   322.38 336.38
## 
## Step:  AIC=300.57
## dat$mTORSubstrate ~ dat$Avg.Fold + dat$X5m + dat$X20m + dat$Ins.1 + 
##     dat$LY + dat$MK
## 
##                Df Deviance    AIC
## - dat$X5m       1   287.19 299.19
## - dat$Avg.Fold  1   288.31 300.31
## <none>              286.57 300.57
## - dat$X20m      1   291.07 303.07
## - dat$MK        1   291.31 303.31
## - dat$Ins.1     1   295.34 307.34
## - dat$LY        1   324.53 336.53
## 
## Step:  AIC=299.19
## dat$mTORSubstrate ~ dat$Avg.Fold + dat$X20m + dat$Ins.1 + dat$LY + 
##     dat$MK
## 
##                Df Deviance    AIC
## - dat$Avg.Fold  1   288.31 298.31
## <none>              287.19 299.19
## - dat$X20m      1   291.43 301.43
## - dat$MK        1   292.15 302.15
## - dat$Ins.1     1   296.11 306.11
## - dat$LY        1   324.79 334.79
## 
## Step:  AIC=298.31
## dat$mTORSubstrate ~ dat$X20m + dat$Ins.1 + dat$LY + dat$MK
## 
##             Df Deviance    AIC
## <none>           288.31 298.31
## - dat$X20m   1   292.88 300.88
## - dat$MK     1   294.08 302.08
## - dat$Ins.1  1   296.18 304.18
## - dat$LY     1   330.09 338.09
summary(glm.mtor1)
## 
## Call:
## glm(formula = dat$mTORSubstrate ~ dat$X20m + dat$Ins.1 + dat$LY + 
##     dat$MK, family = binomial, data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0801  -0.0524  -0.0400  -0.0338   3.8343  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -7.2917     0.3284 -22.204  < 2e-16 ***
## dat$X20m      0.3612     0.1727   2.091  0.03651 *  
## dat$Ins.1     0.5825     0.2140   2.722  0.00649 ** 
## dat$LY       -1.0901     0.1604  -6.797 1.07e-11 ***
## dat$MK        0.5596     0.2230   2.509  0.01211 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 371.21  on 12061  degrees of freedom
## Residual deviance: 288.31  on 12057  degrees of freedom
## AIC: 298.31
## 
## Number of Fisher Scoring iterations: 9
#explain the model
coef(glm.mtor1)
## (Intercept)    dat$X20m   dat$Ins.1      dat$LY      dat$MK 
##  -7.2916990   0.3612087   0.5824630  -1.0900726   0.5595865
exp(glm.mtor1$coefficients)#the relationship between odds and x
##  (Intercept)     dat$X20m    dat$Ins.1       dat$LY       dat$MK 
## 0.0006811698 1.4350628751 1.7904427850 0.3361920776 1.7499486853
exp(confint(glm.mtor1))#Confidence interval
## Waiting for profiling to be done...
##                    2.5 %      97.5 %
## (Intercept) 0.0003362352 0.001229124
## dat$X20m    1.0296699001 2.014685022
## dat$Ins.1   1.1809544351 2.689006282
## dat$LY      0.2458088900 0.462603388
## dat$MK      1.1110386882 2.636092506
xp0.5<-0/glm.mtor1$coefficients[]
xp0.5#return x which makes pi equal to 0.5
## (Intercept)    dat$X20m   dat$Ins.1      dat$LY      dat$MK 
##           0           0           0           0           0
ratio05<-glm.mtor1$coefficients[]*0.25
ratio05
## (Intercept)    dat$X20m   dat$Ins.1      dat$LY      dat$MK 
## -1.82292475  0.09030217  0.14561574 -0.27251816  0.13989662

4.4.6 Evaluate Logistic Regression baseline model for mTORSubstrate class

#evaluate the model
R2cox<-1-exp((glm.mtor1$deviance-glm.mtor1$null.deviance)/length(dat$mTORSubstrate))
cat("Cox-Snell R2=",R2cox,"\n")
## Cox-Snell R2= 0.006849037
R2nag<-R2cox/(1-exp((-glm.mtor1$null.deviance)/length(dat$mTORSubstrate)))
R2nag
## [1] 0.2259933
cat("Nagelkerke R2=",R2nag,"\n")
## Nagelkerke R2= 0.2259933
plot(residuals(glm.mtor1))

influencePlot(glm.mtor1)

##          StudRes          Hat      CookD
## 636   -0.9516118 0.2045959821 0.03096525
## 3469   3.8729370 0.0001911857 0.05954741
## 4973  -1.1617307 0.1877282726 0.04506566
## 8839   3.0116993 0.0070395095 0.10194369
## 10346  3.6462645 0.0009243629 0.10866495
## 11110  3.6953068 0.0001305619 0.02274960

Summary of evaluation:

fitted.pi<-fitted(glm.mtor1)
ypred<-1*(fitted.pi>0.5)
length(ypred)
## [1] 12062
n<-table(dat$mTORSubstrate,ypred)
n
##    ypred
##         0     1
##   0 12036     0
##   1    25     1
cat("sensitivity=",n[2,2]/sum(n[2,]),"\n")
## sensitivity= 0.03846154
cat("specificity=",n[1,1]/sum(n[1,]),"\n")
## specificity= 1

5. Adaptive Sampling

Adaptive sampling has been applied to the four models. Adaptive sampling assigns probability to the unlabeled data of being positive (e.g. Akt) or negative (e.g. not Akt). Once the class probability of the unlabeled data is assigned it is then fed back into the model on choice to give the model additional data, ideally resulting in better model performance.

5.1 AdaSampling with SVM

AktSubstrate class

# create empty list to combine all results for analysis.
result.final <- c()

dat$Identifier <- NULL
dat$Seq.Window <-NULL

# create an index to remember which rows are positively labelled as akt
aktIdx = which(dat$aktSubstrate==1) 
# create an index to remember which rows are positively labelled as mTOR
mTORIdx = which(dat$mTORSubstrate==1) 

aktSubstrate.dat <- dat[c(-16)]

# this does not make "0" into 0, and "1" into 1
aktSubstrate.dat$aktSubstrate = as.numeric(as.factor(aktSubstrate.dat$aktSubstrate)) 

aktSubstrate.dat$aktSubstrate_corrected = as.numeric(as.factor(aktSubstrate.dat$aktSubstrate)) - 1

dim(aktSubstrate.dat)
## [1] 12062    16
data.cls.truth <- aktSubstrate.dat$aktSubstrate_corrected

aktSubstrate.dat <- aktSubstrate.dat[,-c(15:16)]

dim(aktSubstrate.dat)
## [1] 12062    14
k = 10
set.seed(1)
fold=createFolds(data.cls.truth, k)
TP <- TN <- FP <- FN <- c()
for (i in 1:length(fold)){
  train.mat = aktSubstrate.dat[-fold[[i]],]
  test.mat = aktSubstrate.dat[fold[[i]],]
  cls = data.cls.truth[-fold[[i]]]
  
  #index positive and genative instances
  Ps = rownames(train.mat)[which(cls == 1)]
  Ns = rownames(train.mat)[which(cls == 0)]
  
  pred.prob = adaSample_SVM(Ps, Ns, train.mat, test.mat, classifier = "svm", cost=1, gamma=0.01)
  pred = ifelse(pred.prob[, "P"] >0.5, 1, 0)
  
  TP <- c(TP, sum((data.cls.truth[fold[[i]]] == pred)[data.cls.truth[fold[[i]]] == "1"]))
  TN <- c(TN, sum((data.cls.truth[fold[[i]]] == pred)[data.cls.truth[fold[[i]]] == "0"]))
  FP <- c(FP, sum((data.cls.truth[fold[[i]]] != pred)[pred == "1"]))
  FN <- c(FN, sum((data.cls.truth[fold[[i]]] != pred)[pred == "0"]))
}

mean(F1(cbind(TN, FP, TP, FN)))
## [1] 0.04680317
mean(Sen(cbind(TN, FP, TP, FN)))
## [1] 0.8966667
mean(Spe(cbind(TN, FP, TP, FN)))
## [1] 0.9309837
mean(Acc(cbind(TN, FP, TP, FN)))
## [1] 0.9308546
result.final <- rbind(result.final, c("AdaSampling (svm)", "Akt", 
                  mean(F1(cbind(TN, FP, TP, FN))),
                  mean(Acc(cbind(TN, FP, TP, FN))),
                  mean(Sen(cbind(TN, FP, TP, FN))),
                  mean(Spe(cbind(TN, FP, TP, FN)))))


posLikeRatio = mean(Sen(cbind(TN, FP, TP, FN)))/(1-mean(Spe(cbind(TN, FP, TP, FN))))
paste("AdaSampling (svm) for.Akt - Positive Likelihood Ratio = ",posLikeRatio)
## [1] "AdaSampling (svm) for.Akt - Positive Likelihood Ratio =  12.9921003349213"

mTORSubstrate class

mTORSubstrate.dat <- dat[c(-15)]

mTORSubstrate.dat$mTORSubstrate = as.numeric(as.factor(mTORSubstrate.dat$mTORSubstrate)) #TKP this does not make "0" into 0, and "1" into 1


## TKP code ##
mTORSubstrate.dat$mTORSubstrate_corrected = as.numeric(as.factor(mTORSubstrate.dat$mTORSubstrate)) - 1

dim(mTORSubstrate.dat)
## [1] 12062    16
data.cls.truth <- mTORSubstrate.dat$mTORSubstrate_corrected

mTORSubstrate.dat <- mTORSubstrate.dat[,-c(15:16)]
k = 10
set.seed(1)

#data.cls.truth

fold=createFolds(data.cls.truth, k)
TP <- TN <- FP <- FN <- c()
for (i in 1:length(fold)){
  train.mat = mTORSubstrate.dat[-fold[[i]],]
  test.mat = mTORSubstrate.dat[fold[[i]],]
  cls = data.cls.truth[-fold[[i]]]
  
  #index positive and genative instances
  Ps = rownames(train.mat)[which(cls == 1)]
  Ns = rownames(train.mat)[which(cls == 0)]
  
  pred.prob = adaSample_SVM(Ps, Ns, train.mat, test.mat, classifier = "svm", cost=1, gamma=0.01)
  pred = ifelse(pred.prob[, "P"] >0.5, 1, 0)
  
  TP <- c(TP, sum((data.cls.truth[fold[[i]]] == pred)[data.cls.truth[fold[[i]]] == "1"]))
  TN <- c(TN, sum((data.cls.truth[fold[[i]]] == pred)[data.cls.truth[fold[[i]]] == "0"]))
  FP <- c(FP, sum((data.cls.truth[fold[[i]]] != pred)[pred == "1"]))
  FN <- c(FN, sum((data.cls.truth[fold[[i]]] != pred)[pred == "0"]))
}

mean(F1(cbind(TN, FP, TP, FN)))
## [1] 0.02001592
mean(Sen(cbind(TN, FP, TP, FN)))
## [1] 0.855
mean(Spe(cbind(TN, FP, TP, FN)))
## [1] 0.8148946
mean(Acc(cbind(TN, FP, TP, FN)))
## [1] 0.8149533
# add to result.final
result.final <- rbind(result.final,c("AdaSampling (svm)", "mToR", 
                  mean(F1(cbind(TN, FP, TP, FN))),
                  mean(Acc(cbind(TN, FP, TP, FN))),
                  mean(Sen(cbind(TN, FP, TP, FN))),
                  mean(Spe(cbind(TN, FP, TP, FN)))))

posLikeRatio = mean(Sen(cbind(TN, FP, TP, FN)))/(1-mean(Spe(cbind(TN, FP, TP, FN))))
paste("AdaSampling (svm) for mToR - Positive Likelihood Ratio = ",posLikeRatio)
## [1] "AdaSampling (svm) for mToR - Positive Likelihood Ratio =  4.61898938993955"

5.2 Ada Sampling using KNN

AktSubstrate class

dat$Identifier <- NULL
dat$Seq.Window <-NULL

# create an index to remember which rows are positively labelled as akt
aktIdx = which(dat$aktSubstrate==1) 
# create an index to remember which rows are positively labelled as mTOR
mTORIdx = which(dat$mTORSubstrate==1) 

aktSubstrate.dat <- dat[c(-16)]
# this does not make "0" into 0, and "1" into 1
aktSubstrate.dat$aktSubstrate = as.numeric(as.factor(aktSubstrate.dat$aktSubstrate)) 

aktSubstrate.dat$aktSubstrate_corrected = as.numeric(as.factor(aktSubstrate.dat$aktSubstrate)) - 1


dim(aktSubstrate.dat)
## [1] 12062    16
aktSubstrate.dat[1:2,]
##   Avg.Fold       AUC       X15s      X30s      X1m      X2m      X5m
## 1 1.031512 0.2967288 -0.1858515 0.5813547 1.888572 2.101900 1.842677
## 2 1.216053 0.5279128  0.1077492 0.2802415 1.247199 2.027167 2.637666
##       X10m     X20m        X60m     Ins.1         LY    Ins.2       MK
## 1 2.287874 1.343137 -1.60756467 0.6526217 0.34193177 2.157001 2.137921
## 2 2.374491 1.103292 -0.04938089 0.6112755 0.04620434 3.096346 3.226832
##   aktSubstrate aktSubstrate_corrected
## 1            1                      0
## 2            1                      0
data.cls.truth <- aktSubstrate.dat$aktSubstrate_corrected

aktSubstrate.dat <- aktSubstrate.dat[,-c(15:16)]
k = 10
set.seed(1)
fold=createFolds(data.cls.truth, k)
TP <- TN <- FP <- FN <- c()
for (i in 1:length(fold)){
  train.mat = aktSubstrate.dat[-fold[[i]],]
  test.mat = aktSubstrate.dat[fold[[i]],]
  cls = data.cls.truth[-fold[[i]]]
  
  #index positive and genative instances
  Ps = rownames(train.mat)[which(cls == 1)]
  Ns = rownames(train.mat)[which(cls == 0)]
  
  pred.prob = adaSample(Ps, Ns, train.mat, test.mat, classifier = "knn")
  pred = ifelse(pred.prob[, "P"] >0.5, 1, 0)
  
  TP <- c(TP, sum((data.cls.truth[fold[[i]]] == pred)[data.cls.truth[fold[[i]]] == "1"]))
  TN <- c(TN, sum((data.cls.truth[fold[[i]]] == pred)[data.cls.truth[fold[[i]]] == "0"]))
  FP <- c(FP, sum((data.cls.truth[fold[[i]]] != pred)[pred == "1"]))
  FN <- c(FN, sum((data.cls.truth[fold[[i]]] != pred)[pred == "0"]))
}

mean(F1(cbind(TN, FP, TP, FN)))
## [1] 0.03898407
mean(Sen(cbind(TN, FP, TP, FN)))
## [1] 0.8966667
mean(Spe(cbind(TN, FP, TP, FN)))
## [1] 0.9203549
mean(Acc(cbind(TN, FP, TP, FN)))
## [1] 0.9202451
result.final <- rbind(result.final, c("AdaSampling (knn)", "Akt", 
                  mean(F1(cbind(TN, FP, TP, FN))),
                  mean(Acc(cbind(TN, FP, TP, FN))),
                  mean(Sen(cbind(TN, FP, TP, FN))),
                  mean(Spe(cbind(TN, FP, TP, FN)))))


posLikeRatio = mean(Sen(cbind(TN, FP, TP, FN)))/(1-mean(Spe(cbind(TN, FP, TP, FN))))
paste("AdaSampling (knn) for.Akt - Positive Likelihood Ratio = ",posLikeRatio)
## [1] "AdaSampling (knn) for.Akt - Positive Likelihood Ratio =  11.2582823268712"

mTORSubstrate class

mTORSubstrate.dat <- dat[c(-15)]

mTORSubstrate.dat$mTORSubstrate = as.numeric(as.factor(mTORSubstrate.dat$mTORSubstrate)) #TKP this does not make "0" into 0, and "1" into 1


## TKP code ##
mTORSubstrate.dat$mTORSubstrate_corrected = as.numeric(as.factor(mTORSubstrate.dat$mTORSubstrate)) - 1

## TKP code end ##

data.cls.truth <- mTORSubstrate.dat$mTORSubstrate_corrected
mTORSubstrate.dat <- mTORSubstrate.dat[,-c(15:16)]

dim(mTORSubstrate.dat)
## [1] 12062    14
k = 10
set.seed(1)
fold=createFolds(data.cls.truth, k)
TP <- TN <- FP <- FN <- c()
for (i in 1:length(fold)){
  train.mat = mTORSubstrate.dat[-fold[[i]],]
  test.mat = mTORSubstrate.dat[fold[[i]],]
  cls = data.cls.truth[-fold[[i]]]
  
  #index positive and genative instances
  Ps = rownames(train.mat)[which(cls == 1)]
  Ns = rownames(train.mat)[which(cls == 0)]
  
  pred.prob = adaSample(Ps, Ns, train.mat, test.mat, classifier = "knn")
  pred = ifelse(pred.prob[, "P"] >0.5, 1, 0)
  
  TP <- c(TP, sum((data.cls.truth[fold[[i]]] == pred)[data.cls.truth[fold[[i]]] == "1"]))
  TN <- c(TN, sum((data.cls.truth[fold[[i]]] == pred)[data.cls.truth[fold[[i]]] == "0"]))
  FP <- c(FP, sum((data.cls.truth[fold[[i]]] != pred)[pred == "1"]))
  FN <- c(FN, sum((data.cls.truth[fold[[i]]] != pred)[pred == "0"]))
}

mean(F1(cbind(TN, FP, TP, FN)))
## [1] 0.02254541
mean(Sen(cbind(TN, FP, TP, FN)))
## [1] 0.815
mean(Spe(cbind(TN, FP, TP, FN)))
## [1] 0.8487982
mean(Acc(cbind(TN, FP, TP, FN)))
## [1] 0.8486133
# add to result.final
result.final <- rbind(result.final,c("AdaSampling (knn)", "mToR", 
                  mean(F1(cbind(TN, FP, TP, FN))),
                  mean(Acc(cbind(TN, FP, TP, FN))),
                  mean(Sen(cbind(TN, FP, TP, FN))),
                  mean(Spe(cbind(TN, FP, TP, FN)))))

posLikeRatio = mean(Sen(cbind(TN, FP, TP, FN)))/(1-mean(Spe(cbind(TN, FP, TP, FN))))
paste("AdaSampling (knn) for mToR - Positive Likelihood Ratio = ",posLikeRatio)
## [1] "AdaSampling (knn) for mToR - Positive Likelihood Ratio =  5.39014815946263"

5.3 Ada Sampling using Logistic Regression model

AktSubstrate class

# create an index to remember which rows are positively labelled as akt
aktIdx = which(dat$aktSubstrate==1) 
# create an index to remember which rows are positively labelled as mTOR
mTORIdx = which(dat$mTORSubstrate==1) 

aktSubstrate.dat <- dat[c(-16)]

# this does not make "0" into 0, and "1" into 1
aktSubstrate.dat$aktSubstrate = as.numeric(as.factor(aktSubstrate.dat$aktSubstrate)) 

aktSubstrate.dat$aktSubstrate_corrected = as.numeric(as.factor(aktSubstrate.dat$aktSubstrate)) - 1


dim(aktSubstrate.dat)
## [1] 12062    16
aktSubstrate.dat[1:2,]
##   Avg.Fold       AUC       X15s      X30s      X1m      X2m      X5m
## 1 1.031512 0.2967288 -0.1858515 0.5813547 1.888572 2.101900 1.842677
## 2 1.216053 0.5279128  0.1077492 0.2802415 1.247199 2.027167 2.637666
##       X10m     X20m        X60m     Ins.1         LY    Ins.2       MK
## 1 2.287874 1.343137 -1.60756467 0.6526217 0.34193177 2.157001 2.137921
## 2 2.374491 1.103292 -0.04938089 0.6112755 0.04620434 3.096346 3.226832
##   aktSubstrate aktSubstrate_corrected
## 1            1                      0
## 2            1                      0
data.cls.truth <- aktSubstrate.dat$aktSubstrate_corrected
aktSubstrate.dat <- aktSubstrate.dat[,-c(15:16)]
k = 10
set.seed(1)
fold=createFolds(data.cls.truth, k)
TP <- TN <- FP <- FN <- c()
for (i in 1:length(fold)){
  train.mat = aktSubstrate.dat[-fold[[i]],-13]
  test.mat = aktSubstrate.dat[fold[[i]],-13]
  cls = data.cls.truth[-fold[[i]]]
  
  #index positive and genative instances
  Ps = rownames(train.mat)[which(cls == 1)]
  Ns = rownames(train.mat)[which(cls == 0)]
  
  pred.prob = adaSample(Ps, Ns, train.mat, test.mat, classifier = "logit")
  pred = ifelse(pred.prob[, "P"] >0.5, 1, 0)
  
  TP <- c(TP, sum((data.cls.truth[fold[[i]]] == pred)[data.cls.truth[fold[[i]]] == "1"]))
  TN <- c(TN, sum((data.cls.truth[fold[[i]]] == pred)[data.cls.truth[fold[[i]]] == "0"]))
  FP <- c(FP, sum((data.cls.truth[fold[[i]]] != pred)[pred == "1"]))
  FN <- c(FN, sum((data.cls.truth[fold[[i]]] != pred)[pred == "0"]))
}
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
mean(F1(cbind(TN, FP, TP, FN)))
## [1] 0.03257058
mean(Sen(cbind(TN, FP, TP, FN)))
## [1] 0.955
mean(Spe(cbind(TN, FP, TP, FN)))
## [1] 0.8766783
mean(Acc(cbind(TN, FP, TP, FN)))
## [1] 0.876732
result.final <- rbind(result.final, c("AdaSampling (logit)", "Akt", 
                  mean(F1(cbind(TN, FP, TP, FN))),
                  mean(Acc(cbind(TN, FP, TP, FN))),
                  mean(Sen(cbind(TN, FP, TP, FN))),
                  mean(Spe(cbind(TN, FP, TP, FN)))))


posLikeRatio = mean(Sen(cbind(TN, FP, TP, FN)))/(1-mean(Spe(cbind(TN, FP, TP, FN))))
paste("AdaSampling (logit) for.Akt - Positive Likelihood Ratio = ",posLikeRatio)
## [1] "AdaSampling (logit) for.Akt - Positive Likelihood Ratio =  7.74397143509124"

mTORSubstrate class

mTORSubstrate.dat <- dat[c(-15)]

mTORSubstrate.dat$mTORSubstrate = as.numeric(as.factor(mTORSubstrate.dat$mTORSubstrate)) #TKP this does not make "0" into 0, and "1" into 1


## TKP code ##
mTORSubstrate.dat$mTORSubstrate_corrected = as.numeric(as.factor(mTORSubstrate.dat$mTORSubstrate)) - 1


## TKP code end ##
data.cls.truth <- mTORSubstrate.dat$mTORSubstrate_corrected
mTORSubstrate.dat <- mTORSubstrate.dat[,-c(15:16)]

dim(mTORSubstrate.dat)
## [1] 12062    14
k = 10
set.seed(1)
fold=createFolds(data.cls.truth, k)
TP <- TN <- FP <- FN <- c()
for (i in 1:length(fold)){
  train.mat = mTORSubstrate.dat[-fold[[i]],-13]
  test.mat = mTORSubstrate.dat[fold[[i]],-13]
  cls = data.cls.truth[-fold[[i]]]
  
  #index positive and genative instances
  Ps = rownames(train.mat)[which(cls == 1)]
  Ns = rownames(train.mat)[which(cls == 0)]
  
  pred.prob = adaSample(Ps, Ns, train.mat, test.mat, classifier = "logit")
  pred = ifelse(pred.prob[, "P"] >0.5, 1, 0)
  
  TP <- c(TP, sum((data.cls.truth[fold[[i]]] == pred)[data.cls.truth[fold[[i]]] == "1"]))
  TN <- c(TN, sum((data.cls.truth[fold[[i]]] == pred)[data.cls.truth[fold[[i]]] == "0"]))
  FP <- c(FP, sum((data.cls.truth[fold[[i]]] != pred)[pred == "1"]))
  FN <- c(FN, sum((data.cls.truth[fold[[i]]] != pred)[pred == "0"]))
}
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
mean(F1(cbind(TN, FP, TP, FN)))
## [1] 0.01502278
mean(Sen(cbind(TN, FP, TP, FN)))
## [1] 0.8016667
mean(Spe(cbind(TN, FP, TP, FN)))
## [1] 0.7814113
mean(Acc(cbind(TN, FP, TP, FN)))
## [1] 0.7813835
# add to result.final
result.final <- rbind(result.final,c("AdaSampling (logit)", "mToR", 
                  mean(F1(cbind(TN, FP, TP, FN))),
                  mean(Acc(cbind(TN, FP, TP, FN))),
                  mean(Sen(cbind(TN, FP, TP, FN))),
                  mean(Spe(cbind(TN, FP, TP, FN)))))

posLikeRatio = mean(Sen(cbind(TN, FP, TP, FN)))/(1-mean(Spe(cbind(TN, FP, TP, FN))))
paste("AdaSampling (logit) for mToR - Positive Likelihood Ratio = ",posLikeRatio)
## [1] "AdaSampling (logit) for mToR - Positive Likelihood Ratio =  3.66746583220171"

5.4 Ada Sampling using Random Forest

Prepare Akt substrate dataset for AdaSampling (mTOR substrate will have a different dataset)

dat.extra.ada <- dat.extra %>% select(-Type) # Remove sequence window and type features.

# Explicitly store the Identifier into a separate matrix to reference back later.
rowNameToIdentifierMap <- data.frame(rownames(dat.extra.ada), dat.extra.ada$Identifier, dat.extra.ada$Seq.Window)

dat.extra.ada$Seq.Window <- NULL # Drop this factor window

names(rowNameToIdentifierMap) <- c("rowName", "Identifier", "Seq.Window")

dat.extra.ada$Identifier <- NULL # Remove Identifier column (each identifier is unique and not useful for classification)

# Save the positive (Akt) and negative (mTOR / unknown) classes to a separate dataframe for use in AdaSample. Postive and negative classes are indexed in the adaSampleWrapperRF fucntion.

dat.extra.ada.akt_cls <- dat.extra.ada$aktSubstrate
dat.extra.ada.mtor_cls <- dat.extra.ada$mTORSubstrate

# Drop the labels from the dat.extra.ada dataframe
dat.extra.ada$aktSubstrate <-NULL
dat.extra.ada$mTORSubstrate <-NULL

colnames(dat.extra.ada)
##  [1] "Avg.Fold" "AUC"      "X15s"     "X30s"     "X1m"      "X2m"     
##  [7] "X5m"      "X10m"     "X20m"     "X60m"     "Ins.1"    "LY"      
## [13] "Ins.2"    "MK"       "diffP1"   "diffP2"   "diffP3"   "diffP4"  
## [19] "diffP5"   "diffP6"   "diffP7"   "s1"       "s2"       "s3"      
## [25] "s4"       "s5"       "s6"       "s7"       "s8"       "s9"      
## [31] "s10"      "s11"      "s12"      "s13"

Akt Substrate

# 
# score <- adaSampleWrapperRF(dat = dat.extra.ada, cls = dat.extra.ada.mtor_cls, cls.known.neg = dat.extra.ada.akt_cls, folds = adafolds.mTOR, mtry=m, ntree=n)
# 
# trn.cls <- cls[-fold]
# tst.cls <- cls[fold] # need to use to estimate sensitivity
# 
# tst.cls.known.neg <- cls.known.neg[fold] # Need to use to estimate lower bound for specificity
# 
# train.mat <- dat[-fold,]
# test.mat <- dat[fold,]
# cls <- dat.extra.ada.mtor_cls

# Index positive and negative instances, this assumes all unlabeled samples are of the negative class
Ps <- rownames(dat.extra.ada)[which(dat.extra.ada.akt_cls == 1)]
Ns <- rownames(dat.extra.ada)[which(dat.extra.ada.akt_cls == 0)]

## Perform AdaSampling on the Training Set and provide prediction on the test set
# require(AdaSampling)
pred.prob <- adaSample(Ps, Ns, dat.extra.ada, test=dat.extra.ada, classifier="rf", C=50, mtry=25, ntree=800)

dim(pred.prob)
## [1] 12062     2
head(pred.prob[,])
##          N        P
## 1 0.156625 0.843375
## 2 0.302300 0.697700
## 3 0.981075 0.018925
## 4 0.984100 0.015900
## 5 0.993950 0.006050
## 6 0.988075 0.011925
sum(dat.extra.ada.akt_cls==1)
## [1] 22
# summary of predictions
pred = ifelse(pred.prob[,"P"] > 0.5, 1, 0)

TP <- sum((dat.extra.ada.akt_cls == pred)[dat.extra.ada.akt_cls == "1"])
TN <- sum((dat.extra.ada.akt_cls == pred)[dat.extra.ada.akt_cls == "0"])
FP <- sum((dat.extra.ada.akt_cls != pred)[pred == "1"])
FN <- sum((dat.extra.ada.akt_cls != pred)[pred == "0"])

# add to result.final
result.final <- rbind(result.final,c("AdaSampling (RF)", "Akt", 
                  F1(cbind(TN, FP, TP, FN)),
                  Acc(cbind(TN, FP, TP, FN)),
                  Sen(cbind(TN, FP, TP, FN)),
                  Spe(cbind(TN, FP, TP, FN))))


#head(dat.extra.ada.akt_cls)

posLikeRatio = Sen(cbind(TN, FP, TP, FN))/(1-Spe(cbind(TN, FP, TP, FN)))
paste("AdaSampling (Final) - Positive Likelihood Ratio = ",posLikeRatio)
## [1] "AdaSampling (Final) - Positive Likelihood Ratio =  15.6576665840971"

mTOR predictions

Ps <- rownames(dat.extra.ada)[which(dat.extra.ada.mtor_cls == 1)]
Ns <- rownames(dat.extra.ada)[which(dat.extra.ada.mtor_cls == 0)]

## Perform AdaSampling on the Training Set and provide prediction on the test set
# require(AdaSampling)
pred.prob.mTOR <- adaSample(Ps, Ns, dat.extra.ada, test=dat.extra.ada, classifier="rf", C=50, mtry=32, ntree=100)

dim(pred.prob.mTOR)
## [1] 12062     2
head(pred.prob.mTOR)
##        N      P
## 1 0.4622 0.5378
## 2 0.4874 0.5126
## 3 0.9012 0.0988
## 4 0.9442 0.0558
## 5 0.9514 0.0486
## 6 0.9406 0.0594
# summary of predictions
pred = ifelse(pred.prob.mTOR[,"P"] > 0.5, 1, 0)

TP <- sum((dat.extra.ada.mtor_cls == pred)[dat.extra.ada.mtor_cls == "1"])
TN <- sum((dat.extra.ada.mtor_cls == pred)[dat.extra.ada.mtor_cls == "0"])
FP <- sum((dat.extra.ada.mtor_cls != pred)[pred == "1"])
FN <- sum((dat.extra.ada.mtor_cls != pred)[pred == "0"])

# add to result.final
result.final <- rbind(result.final,c("AdaSampling (RF)", "mToR", 
                  F1(cbind(TN, FP, TP, FN)),
                  Acc(cbind(TN, FP, TP, FN)),
                  Sen(cbind(TN, FP, TP, FN)),
                  Spe(cbind(TN, FP, TP, FN))))


sum(dat.extra.ada.mtor_cls==1)
## [1] 26
head(dat.extra.ada.mtor_cls)
## [1] 0 0 0 0 0 0
## Levels: 0 1

6. Validation and benchmark of prediction results

Compare predictions to those of the 2016 study. Use the best identified cost and gamma values for generating a final adasample on the whole dataset using an SVM as the base classifier inside the AdaSampling method.

6.1 Compare the Ada Sampling methods for both Akt and mTOR predictions

# write the results into csv

result.final <- data.frame(result.final)
names(result.final) <- c("Method", "Substrate","F1","Accuracy","Sensitivity", "Specificity")

result.final <- transform(result.final, F1 = as.numeric(levels(F1))[F1],
                                        Accuracy = as.numeric(levels(Accuracy))[Accuracy],
                                        Sensitivity = as.numeric(levels(Sensitivity))[Sensitivity],
                                        Specificity = as.numeric(levels(Specificity))[Specificity])


write.csv(result.final, './Output/result.final.csv')

# print the results for comparision
result.final
##                Method Substrate         F1  Accuracy Sensitivity
## 1   AdaSampling (svm)       Akt 0.04680317 0.9308546   0.8966667
## 2   AdaSampling (svm)      mToR 0.02001592 0.8149533   0.8550000
## 3   AdaSampling (knn)       Akt 0.03898407 0.9202451   0.8966667
## 4   AdaSampling (knn)      mToR 0.02254541 0.8486133   0.8150000
## 5 AdaSampling (logit)       Akt 0.03257058 0.8767320   0.9550000
## 6 AdaSampling (logit)      mToR 0.01502278 0.7813835   0.8016667
## 7    AdaSampling (RF)       Akt 0.05405405 0.9390648   0.9545455
## 8    AdaSampling (RF)      mToR 0.02135231 0.8176090   0.9230769
##   Specificity
## 1   0.9309837
## 2   0.8148946
## 3   0.9203549
## 4   0.8487982
## 5   0.8766783
## 6   0.7814113
## 7   0.9390365
## 8   0.8173812

6.2 Generate final predictions on the unlabeled dataset using the best identified classifier from prior experiments

AktFinalPred <- data.frame(pred.prob,dat.extra.ada.akt_cls) %>% merge(rowNameToIdentifierMap, by.x="row.names", by.y="rowName") %>%arrange(desc(P)) %>% filter(dat.extra.ada.akt_cls == 0) 

mTORFinalPred <- data.frame(pred.prob.mTOR,dat.extra.ada.mtor_cls) %>% merge(rowNameToIdentifierMap, by.x="row.names", by.y="rowName") %>%arrange(desc(P)) %>% filter(dat.extra.ada.mtor_cls == 0) 

head(AktFinalPred %>% arrange(Row.names))
##   Row.names        N        P dat.extra.ada.akt_cls         Identifier
## 1         1 0.156625 0.843375                     0      100043914;88;
## 2        10 0.920175 0.079825                     0 1700009N14RIK;135;
## 3       100 0.967825 0.032175                     0         A2MR;4525;
## 4      1000 0.859850 0.140150                     0      ARHGAP17;575;
## 5     10000 0.960100 0.039900                     0          RBM15;20;
## 6     10001 0.967150 0.032850                     0         RBM15;211;
##      Seq.Window
## 1 GRHERRSSAEQHS
## 2 RKVKAKSIVFHRK
## 3 RHSLASTDEKREL
## 4 LSPGDSSPPKPKD
## 5 PRWRRASPLCETS
## 6 HLSGSGSGDERVA
head(mTORFinalPred %>% arrange(Row.names))
##   Row.names      N      P dat.extra.ada.mtor_cls         Identifier
## 1         1 0.4622 0.5378                      0      100043914;88;
## 2        10 0.9364 0.0636                      0 1700009N14RIK;135;
## 3       100 0.7824 0.2176                      0         A2MR;4525;
## 4      1000 0.9340 0.0660                      0      ARHGAP17;575;
## 5     10000 0.9640 0.0360                      0          RBM15;20;
## 6     10001 0.9450 0.0550                      0         RBM15;211;
##      Seq.Window
## 1 GRHERRSSAEQHS
## 2 RKVKAKSIVFHRK
## 3 RHSLASTDEKREL
## 4 LSPGDSSPPKPKD
## 5 PRWRRASPLCETS
## 6 HLSGSGSGDERVA
# Combine the results
final.pred.combined <- AktFinalPred %>% rename(AktProb = P) %>% select(Row.names, AktProb) %>% merge(mTORFinalPred %>% rename(mTORProb = P) %>% select(Row.names, mTORProb, Identifier, Seq.Window), by="Row.names") %>%
mutate(AktPred = ifelse(AktProb > mTORProb & AktProb > 0.5, 1,0),
       mTORPred = ifelse(mTORProb > AktProb & mTORProb > 0.5, 1,0))

head(final.pred.combined)
##   Row.names  AktProb mTORProb         Identifier    Seq.Window AktPred
## 1         1 0.843375   0.5378      100043914;88; GRHERRSSAEQHS       1
## 2        10 0.079825   0.0636 1700009N14RIK;135; RKVKAKSIVFHRK       0
## 3       100 0.032175   0.2176         A2MR;4525; RHSLASTDEKREL       0
## 4      1000 0.140150   0.0660      ARHGAP17;575; LSPGDSSPPKPKD       0
## 5     10000 0.039900   0.0360          RBM15;20; PRWRRASPLCETS       0
## 6     10001 0.032850   0.0550         RBM15;211; HLSGSGSGDERVA       0
##   mTORPred
## 1        0
## 2        0
## 3        0
## 4        0
## 5        0
## 6        0

6.3 Compare the model predictions to the predictions of academic paper

Prediction2016.akt <- readxl::read_excel("./Data2/Prediction_2016.xlsx", sheet="Akt_prediction")
Prediction2016.akt <- Prediction2016.akt %>% mutate(Identifier=paste(str_to_upper(GeneSymbol),";" ,`Phosphorylation site`, ';', sep=""))

head(Prediction2016.akt)
## # A tibble: 6 x 11
##   GeneSymbol `Phosphorylatio~ `Sequence windo~ `Uniprot ID` `IPI ID`
##   <chr>                 <dbl> <chr>            <chr>        <chr>   
## 1 Irs2                    303 FRPRSKSQSSGSS    B9EJW3       IPI0092~
## 2 Cep131                  114 GRRRVASLSKASS    B1AXI9       IPI0088~
## 3 Ndrg3                   331 ARSRTHSTSSSIG    Q9QYF9       IPI0013~
## 4 Flnc                   2234 GRERLGSFGSITR    Q8VHX6-1     IPI0066~
## 5 Cables1                 272 RRCRTLSGSPRPK    B2RWU9       IPI0092~
## 6 Rtkn                    520 GRPRTFSLDAAPA    Q8C6B2-1     IPI0022~
## # ... with 6 more variables: `Full model predict` <dbl>, `Motif
## #   predict` <dbl>, `Phosphoproteome predict` <dbl>, Delta <dbl>, `Uniprot
## #   database` <chr>, Identifier <chr>
Prediction2016.mtor <- readxl::read_excel("./Data2/Prediction_2016.xlsx", sheet="mTOR_prediction")
Prediction2016.mtor <- Prediction2016.mtor %>% mutate(Identifier=paste(str_to_upper(GeneSymbol),";" ,`Phosphorylation site`, ';', sep=""))

head(Prediction2016.akt)
## # A tibble: 6 x 11
##   GeneSymbol `Phosphorylatio~ `Sequence windo~ `Uniprot ID` `IPI ID`
##   <chr>                 <dbl> <chr>            <chr>        <chr>   
## 1 Irs2                    303 FRPRSKSQSSGSS    B9EJW3       IPI0092~
## 2 Cep131                  114 GRRRVASLSKASS    B1AXI9       IPI0088~
## 3 Ndrg3                   331 ARSRTHSTSSSIG    Q9QYF9       IPI0013~
## 4 Flnc                   2234 GRERLGSFGSITR    Q8VHX6-1     IPI0066~
## 5 Cables1                 272 RRCRTLSGSPRPK    B2RWU9       IPI0092~
## 6 Rtkn                    520 GRPRTFSLDAAPA    Q8C6B2-1     IPI0022~
## # ... with 6 more variables: `Full model predict` <dbl>, `Motif
## #   predict` <dbl>, `Phosphoproteome predict` <dbl>, Delta <dbl>, `Uniprot
## #   database` <chr>, Identifier <chr>
head(Prediction2016.mtor)
## # A tibble: 6 x 11
##   GeneSymbol `Phosphorylatio~ `Sequence windo~ `Uniprot ID` `IPI ID`
##   <chr>                 <dbl> <chr>            <chr>        <chr>   
## 1 Ulk2                    772 GRVCVGSPPGPGL    Q9QY01       IPI0033~
## 2 C2cd5                   295 NQTYSFSPSKSYS    Q7TPS5-1     IPI0040~
## 3 Ulk2                    781 GPGLGSSPPGAEG    Q9QY01       IPI0033~
## 4 Stk11ip                 444 LETMGSSPLSTTK    Q3TAA7       IPI0046~
## 5 Wdr91                   257 PASLSQSPRVGFL    Q7TMQ7       IPI0033~
## 6 Oxr1                     62 RPPSPASPEGPDT    Q8C715       IPI0065~
## # ... with 6 more variables: `Full model predict` <dbl>, `Motif
## #   predict` <dbl>, `Phosphoproteome predict` <dbl>, Delta <dbl>, `Uniprot
## #   database` <chr>, Identifier <chr>
dim(Prediction2016.akt)
## [1] 12267    11
dim(Prediction2016.mtor)
## [1] 12263    11

6.4 Benchmark of prediction results

6.4.1 Sensitivity

p.Sen <- ggplot(result.final) +
  geom_col(aes(x=Method, y=Sensitivity, color=Method, fill=Method)) +
  geom_text(data=result.final,aes(x=Method,y=Sensitivity, label=as.character(round(Sensitivity,2))),vjust=1.2, size=3) +
  xlab("Method") +
  ylab("Avg. Sensitivity over 10 Folds") +
  scale_y_continuous(labels = percent) +
  facet_wrap(~Substrate) +
  ggtitle("AdaSample Classifiers - Sensitivity ") + 
  theme_light() +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

p.Sen

The sensitivity plot results above show the average sensitivity of the AdaSampled classifiers using 10 folds of cross validation to form the estimate. From the column chart plot we can see that using SVM as the base classifier out performs all other methods in both prediction of Akt and mTOR substrates. Note: This sensitivity measure is based only on the known labels of Akt and mTOR, there are potential unlabeled positives that are unaccounted for in the statistics above.

6.4.2 Specificity (lower bound)

p.Spe <- ggplot(result.final) +
  geom_col(aes(x=Method, y=Specificity, color=Method, fill=Method)) +
  geom_text(data=result.final,aes(x=Method,y=Specificity, label=as.character(round(Specificity,2))),vjust=1.2, size=3) +
  xlab("Method") +
  ylab("Avg. Specificity over 10 Folds") +
  scale_y_continuous(labels = percent) +
  facet_wrap(~Substrate) +
  ggtitle("AdaSample Classifiers - Specificity (lower bd) ") + 
  theme_light() +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

p.Spe

The specificity plot results above show the average specificity of the AdaSampled classifiers using 10 folds of cross validation to form the estimate. From the column chart plot we can see that using SVM as the base classifier out performs all other methods in both prediction of Akt and mTOR substrates (with only a small margin between RF and SVM for Akt substrate prediction). Note: This specificity measure is assumes all unlabeled samples are of the negative class for both Akt and mTOR. As a result, the specificity measured in this statistic is a ‘lower bound’ measurement of specificity as there are potential positive samples in the unknown set.

6.5 Generate required outputs

  1. Probability of each sample been mislabelled; and
  2. A predicted list of mislabelled samples from training and test dataset, respectively.
head(final.pred.combined)
##   Row.names  AktProb mTORProb         Identifier    Seq.Window AktPred
## 1         1 0.843375   0.5378      100043914;88; GRHERRSSAEQHS       1
## 2        10 0.079825   0.0636 1700009N14RIK;135; RKVKAKSIVFHRK       0
## 3       100 0.032175   0.2176         A2MR;4525; RHSLASTDEKREL       0
## 4      1000 0.140150   0.0660      ARHGAP17;575; LSPGDSSPPKPKD       0
## 5     10000 0.039900   0.0360          RBM15;20; PRWRRASPLCETS       0
## 6     10001 0.032850   0.0550         RBM15;211; HLSGSGSGDERVA       0
##   mTORPred
## 1        0
## 2        0
## 3        0
## 4        0
## 5        0
## 6        0
# (1) Probability of each sample been mislabelled; and
write.csv(final.pred.combined, "./Output/Akt_mTOR_Prediction_Probabilities.csv")

# (2) A predicted list of mislabelled samples from training and test dataset, respectively.
## (2a - Akt)
write.csv(final.pred.combined %>% filter(AktPred == 1)%>% select(Identifier,Seq.Window, AktProb, AktPred), "./Output/Akt_Prediction_List.csv")

## (2b - mTOR)
write.csv(final.pred.combined %>% filter(mTORPred == 1) %>% select(Identifier,Seq.Window, mTORProb, mTORPred), "./Output/mTOR_Prediction_List.csv")

6.6 Compare predictions with Prediction2016.xlsx predictions

Note, the two datasets AktFinalPred and Prediction2016 appear to have different gene symbol, sites, although there is an intersect between the two.

# Akt
CompareFinalPred <- final.pred.combined %>% merge(Prediction2016.akt%>% dplyr::select(Identifier, `Sequence window`, `Full model predict`) %>% mutate(pred.2016.Akt = ifelse(`Full model predict` > 0.5, 1,0)), by.x=c("Identifier","Seq.Window"), by.y=c("Identifier", "Sequence window"), all.x=T, all.y=F) %>% rename(pred.2016.Akt.prob = `Full model predict`)


# mTOR
CompareFinalPred <- CompareFinalPred %>% merge(Prediction2016.mtor%>% dplyr::select(Identifier, `Sequence window`, `Full model predict`) %>% mutate(pred.2016.mtor = ifelse(`Full model predict` > 0.5, 1,0)), by.x=c("Identifier","Seq.Window"), by.y=c("Identifier", "Sequence window"), all.x=T, all.y=F) %>% rename(pred.2016.mTOR.prob = `Full model predict`)

CompareFinalPred$Row.names <- NULL

names(CompareFinalPred)
##  [1] "Identifier"          "Seq.Window"          "AktProb"            
##  [4] "mTORProb"            "AktPred"             "mTORPred"           
##  [7] "pred.2016.Akt.prob"  "pred.2016.Akt"       "pred.2016.mTOR.prob"
## [10] "pred.2016.mtor"
head(CompareFinalPred)
##            Identifier    Seq.Window  AktProb mTORProb AktPred mTORPred
## 1       100043914;88; GRHERRSSAEQHS 0.843375   0.5378       1        0
## 2       100043914;89; RHERRSSAEQHSE 0.697700   0.5126       1        0
## 3   1110018J18RIK;18; CAGRVPSPAGSVT 0.018925   0.0988       0        0
## 4  1110037F02RIK;133; SVDRVISHDRDSP 0.015900   0.0558       0        0
## 5  1110037F02RIK;138; ISHDRDSPPPPPP 0.006050   0.0486       0        0
## 6 1110037F02RIK;1628; FLSEPSSPGRSKT 0.011925   0.0594       0        0
##   pred.2016.Akt.prob pred.2016.Akt pred.2016.mTOR.prob pred.2016.mtor
## 1                 NA            NA                  NA             NA
## 2                 NA            NA                  NA             NA
## 3                 NA            NA                  NA             NA
## 4         0.03973396             0          0.09465660              0
## 5         0.02154160             0          0.08976856              0
## 6         0.02307574             0          0.14314934              0
## CREDIT: https://ragrawal.wordpress.com/2011/05/16/visualizing-confusion-matrix-in-r/
# Code was adapted from URL above.
# Author: Ritesh Agarwal (2011)

#CompareFinalPred[which(!is.na(CompareFinalPred$pred.2016.Akt)),"pred.2016.Akt"]

#compute frequency of actual categories
actual.Akt = as.data.frame(table(CompareFinalPred[which(!is.na(CompareFinalPred$pred.2016.Akt)),"pred.2016.Akt"]))
names(actual.Akt) = c("Actual","ActualFreq")

#compute confusion matrix
confusion.Akt = as.data.frame(table(CompareFinalPred[which(!is.na(CompareFinalPred$pred.2016.Akt)),"pred.2016.Akt"], CompareFinalPred[which(!is.na(CompareFinalPred$pred.2016.Akt)),"AktPred"]))

names(confusion.Akt) = c("Actual","Predicted","Freq")

# confusion.Akt

# head(confusion.Akt)
# head(actual.Akt)

#calculate percentage of test cases based on actual frequency
confusion.Akt = merge(confusion.Akt, actual.Akt, by=c("Actual"))
confusion.Akt$Percent = confusion.Akt$Freq/confusion.Akt$ActualFreq*100
 
#render plot
# we use three different layers
# first we draw tiles and fill color based on percentage of test cases
tile.Akt <- ggplot() +
geom_tile(aes(x=Actual, y=Predicted,fill=Percent),data=confusion.Akt, color="black",size=0.1) +
labs(x="2016 Prediction",y="SVM Model Prediction") +
  #scale_x_discrete(limits = c(1,0), position = 'top') +
    theme(axis.text.x.top = element_text(angle = 0, vjust=0.5, hjust=0)) +
  ggtitle("Akt Prediction comparison to Prediction2016.xlsx")

tile.Akt = tile.Akt + 
geom_text(aes(x=Actual,y=Predicted, label=sprintf("%1.f (%.1f%%)", Freq, Percent)),data=confusion.Akt, size=3, colour="black", nudge_x = +0.0) +

scale_fill_gradientn(colours = rev(heat.colors(10)))
 
# lastly we draw diagonal tiles. We use alpha = 0 so as not to hide previous layers but use size=0.3 to highlight border
tile.Akt = tile.Akt + 
geom_tile(aes(x=Actual,y=Predicted),data=subset(confusion.Akt, as.character(Actual)==as.character(Predicted)), color="black",size=0.3, fill="black", alpha=0) 
 
#render
tile.Akt

From the pseudo confusion matrix above comparing predictions made by the

## CREDIT: https://ragrawal.wordpress.com/2011/05/16/visualizing-confusion-matrix-in-r/
# Code was adapted from URL above.
# Author: Ritesh Agarwal (2011)

#CompareFinalPred[which(!is.na(CompareFinalPred$pred.2016.mTOR)),"pred.2016.mTOR"]

#compute frequency of actual categories
actual.mTOR = as.data.frame(table(CompareFinalPred[which(!is.na(CompareFinalPred$pred.2016.mtor)),"pred.2016.mtor"]))
names(actual.mTOR) = c("Actual","ActualFreq")

#compute confusion matrix
confusion.mTOR = as.data.frame(table(CompareFinalPred[which(!is.na(CompareFinalPred$pred.2016.mtor)),"pred.2016.mtor"], CompareFinalPred[which(!is.na(CompareFinalPred$pred.2016.mtor)),"mTORPred"]))

names(confusion.mTOR) = c("Actual","Predicted","Freq")

# confusion.mTOR

# head(confusion.mTOR)
# head(actual.mTOR)

#calculate percentage of test cases based on actual frequency
confusion.mTOR = merge(confusion.mTOR, actual.mTOR, by=c("Actual"))
confusion.mTOR$Percent = confusion.mTOR$Freq/confusion.mTOR$ActualFreq*100
 
#render plot
# we use three different layers
# first we draw tiles and fill color based on percentage of test cases
tile.mTOR <- ggplot() +
geom_tile(aes(x=Actual, y=Predicted,fill=Percent),data=confusion.mTOR, color="black",size=0.1) +
labs(x="2016 Prediction",y="SVM Model Prediction") +
  #scale_x_discrete(limits = c(1,0), breaks=c("1","0"), position = 'top') +
    theme(axis.text.x.top = element_text(angle = 0, vjust=0.5, hjust=0)) +
  ggtitle("mTOR Prediction comparison to Prediction2016.xlsx")

tile.mTOR = tile.mTOR + 
geom_text(aes(x=Actual,y=Predicted, label=sprintf("%1.f (%.1f%%)", Freq, Percent)),data=confusion.mTOR, size=3, colour="black", nudge_x = +0.0) +
scale_fill_gradientn(colours = rev(heat.colors(10)))
 
# lastly we draw diagonal tiles. We use alpha = 0 so as not to hide previous layers but use size=0.3 to highlight border
tile.mTOR = tile.mTOR + 
geom_tile(aes(x=Actual,y=Predicted),data=subset(confusion.mTOR, as.character(Actual)==as.character(Predicted)), color="black",size=0.3, fill="black", alpha=0) 
 
#render
tile.mTOR

7. Conclusion

Based on the experimentation and interpretation of results, AdaSampling using random forest as a base classifier is epxected to perform best for novel substrate Akt and mTOR identified. This model will be selected as final classification for Kinase-substrate prediction due to following reasons,
1) Adaptive Sampling handles the positive labelled data well
2) When unlabeled positive labels are accounted for by (1), random forest is robust to class imbalance
3) Sensitivity and specificity is higher among all models when AdaSampling is applied compared to without
4) The RF AdaSample and Prediction 2016 results are more alike for mTOR substrate predictions than Akt substrate predictions for positive samples, based on confusion matrix agreement of (1 (RF AdaSample) and 1 (Prediction 2016)) 5) For negative labeling the predictions are much more closely aligned (0 (RF AdaSample) and 0 (Prediction 2016))

8. Future Works

To truly understand the causes of the underlying difference in prediction, we must investigate in detail samples where the prediction varied. A code comparison to the experiments carried out to produce the “Prediction_2016.xlsx” results needs to be undertaken to identify potential areas where differences occurred, including feature generation, selection and classifier parameters used.